Re: A multi-threaded NFS server for Linux Olaf Kirch (okir@monad.swb.de) Tue, 26 Nov 1996 23:09:08 +0100 * Messages sorted by: [1][ date ][2][ thread ][3][ subject ][4][ author ] * Next message: [5]Olaf Kirch: "Re: rpc.lockd/rpc.statd" * Previous message: [6]Paul Christenson: "smail SPAM filter?" * Next in thread: [7]Linus Torvalds: "Re: A multi-threaded NFS server for Linux" * Reply: [8]Linus Torvalds: "Re: A multi-threaded NFS server for Linux" _________________________________________________________________ Hi all, here are some ramblings about implementing nfsd, the differences between kernel- and user-space, and life in general. It's become quite long, so if you're not interested in either of these topics, just skip it... On Sun, 24 Nov 1996 12:01:01 PST, "H.J. Lu" wrote: > With the upcoming the Linux C library 6.0, it is possible to > implement a multi-threaded NFS server in the user space using > the kernel-based pthread and MT-safe API included in libc 6.0. In my opinion, servicing NFS from user space is an idea that should die. The current unfsd (and I'm pretty sure this will hold for any other implementation) has a host of problems: 1. Speed. This is only partly related to nfsd being single-threaded. I have run some benchmarks a while ago comparing my kernel-based nfsd to the user-space nfsd. In the unfsd case, I was running 4 daemons in parallel (which is possible even now as long as you restrict yourself to read-only access), and found the upper limit for peak throughput was around 800 KBps; the rate for sustained reads was even lower. In comparison, the kernel-based nfsd achieved around than 1.1 MBps peak throughput which is almost the theoretical cheapernet limit; its sustained rate was around 1 MBps. Testers of my recent knfsd implementation reported a sustained rate of 3.8 MBps over 100 Mbps Ethernet. Even though some tweaking of the unfsd source (especially by getting rid of the Sun RPC code) may improve performance some more, I don't believe the user-space can be pushed much further. [Speaking of the RPC library, a rewrite would be required anyway to safely support NFS over TCP. You can easily hang a vanilla RPC server by sending an incomplete request over TCP and keeping the connection open] Now add to that the synchronization overhead required to keep the file handle cache in sync between the various threads... This leads me straight to the next topic: 2. File Handle Layout Traditional nfsds usually stuff a file's device and inode number into the file handle, along with some information on the exported inode. Since a user space program has no way of opening a file just given its inode number, unfsd takes a different approach. It basically creates a hashed version of the file's path. Each path component is stat'ed, and an 8bit hash of the component's device and inode number is used. The first problem is that this kind of file handle is not invariant against renames from one directory to another. Agreed, this doesn't happen too often, but it does break Unix semantics. Try this on an nfs-mounted file system (with appropriate foo and bar): (mv bar foo/bar; cat) < bar The second problem is a lot worse. When unfsd is presented with a file handle it does not have in its cache, it must map it to a valid path name. This is basically done in the following way: path = "/"; depth = 0; while (depth < length(fhandle)) { deeper: dirp = opendir(path); while ((entry = readdir(dirp)) != NULL) { if (hash(dev,ino) matches fhandle component) { remember dirp append entry to path depth++; goto deeper; } } closedir(dirp); backtrack; } Needless to say, this is not very fast. The file handle cache helps a lot here, but this kind of mapping operation occurs far more often than one might expect (consider a development tree where files get created and deleted continuously). In addition, the current implementation discards conflicting handles when there's a hash collision. This file handle layout also leaves little room for any additional baggage. Unfsd currently uses 4 bytes for an inode hash of the file itself and 28 bytes for the hashed path, but as soon as you add other information like the inode generation number, you will sooner or later run out of room. Last not least, the file handle cache must be strictly synchronized between different nfsd processes/threads. Suppose you rename foo to bar, which is performed by thread1, then try to read the file, which is performed by thread2. If the latter doesn't know the cached path is stale, it will fail. You could of course retry every operation that fails with ENOENT, but this will add even more clutter and overhead to the code. 3. Adherence to the NFSv2 specification The Linux nfsd currently does not fulfill the NFSv2 spec in its entirety. Especially when it comes to safe writes, it is really a fake. It neither makes an attempt to sync file data before replying to the client (which could be implemented, along with the `async' export option for turning off this kind of behavior), nor does it sync meta-data after inode operations (which is impossible from user space). To most people this is no big loss, but this behavior is definitely not acceptable if you want industry-strengh NFS. But even if you did implement at least synchronous file writes in unfsd, be it as an option or as the default, there seems to be no way to implement some of the more advanced techniques like gathered writes. When implementing gathered writes, the server tries to detect whether other nfsd threads are writing to the file at the same time (which frequently happens when the client's biods flush out the data on file close), and if they do, it delays syncing file data for a few milliseconds so the others can finish first, and then flushes all data in one go. You can do this in kernel-land by watching inode->i_writecount, but you're totally at a loss in user-space. 4. Supporting NFSv3 A user-space NFS server is not particularly well suited for implementing NFSv3. For instance, NFSv3 tries to help cache consistency on the client by providing pre-operation attributes for some operations, for instance the WRITE call. When a client finds that the pre-operation attributes returned by the server agree with those it has cached, it can safely assume that any data it has cached was still valid when the server replied to its call, so there's no need to discard the cached file data and meta-data. However, pre-op attributes can only be provided safely when the server retains exclusive access to the inode throughout the operation. This is impossible from user space. A similar example is the exclusive create operation where a verifier is stored in the inode's atime/mtime fields by the server to guarantee exactly-once behavior even in the face of request retransmissions. These values cannot be checked atomically by a user-space server. What this boils down to is that a user-space server cannot, without violating the protocol spec, implement many of the advanced features of NFSv3. 5. File locking over NFS Supporting lockd in user-space is close to impossible. I've tried it, and have run into a large number of problems. Some of the highlights: * lockd can provide only a limited number of locks at the same time because it has only a limited number of file descriptors. * When lockd blocks a client's lock request because of a lock held by a local process on the server, it must continuously poll /proc/locks to see whether the request could be granted. What's more, if there's heavy contention for the file, it may take a long time before it succeeds because it cannot add itself to the inode's lock wait list in the kernel. That is, unless you want it to create a new thread just for blocking on this lock. * Lockd must synchronize its file handle cache with that of the NFS servers. Unfortunately, lockd is also needed when running as an NFS client only, so you run into problems with who owns the file handle cache, and how to share it between these to services. 6. Conclusion Alright, this has become rather long. Some of the problems I've described above may be solvable with more or less effort, but I believe that, taken as a whole, they make a pretty strong argument against sticking with a user-space nfsd. In kernel-space, most of these issues are addressed most easily, and more efficiently. My current kernel nfsd is fairly small. Together with the RPC core, which is used by both client and server, it takes up something like 20 pages--don't quote me on the exact number. As mentioned above, it is also pretty fast, and I hope I'll be able to also provide fully functional file locking soon. If you want to take a look at the current snapshot, it's available at ftp.mathematik.th-darmstadt.de/pub/linux/okir/dontuse/linux-nfs-X.Y.ta r.gz. This version still has a bug in the nfsd readdir implementation, but I'll release an updated (and fixed) version as soon as I have the necessary lockd rewrite sorted out. I would particularly welcome comments from Keepers of the Source whether my NFS rewrite has any chance of being incorporated into the kernel at some time... that would definitely motivate me to sick more time into it than I currently do. Happy hacking Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@monad.swb.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax For my PGP public key, finger okir@brewhq.swb.de. _________________________________________________________________ * Next message: [9]Olaf Kirch: "Re: rpc.lockd/rpc.statd" * Previous message: [10]Paul Christenson: "smail SPAM filter?" * Next in thread: [11]Linus Torvalds: "Re: A multi-threaded NFS server for Linux" * Reply: [12]Linus Torvalds: "Re: A multi-threaded NFS server for Linux" Referenser 1. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/date.html#18 2. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/index.html#18 3. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/subject.html#18 4. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/author.html#18 5. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html 6. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html 7. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html 8. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html 9. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html 10. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html 11. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html 12. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html