Design and Implementation of the Sun Network Filesystem
Sandberg, Goldberg, Kleiman, Walsh, Lyon
Overview
NFS uses RPC and XDR to provide a system-independent protocol
for accessing a remote filesystem. It uses a stateless, idempotent protocol to
obviate the need for crash recovery. NFS is implemented in the kernel, and is
transparent to existing applications; programs need do nothing different to
access remote files.
The Protocol
Files and directories are identified not by pathnames
(which are UNIX-specific; NFS is intended to run on non-UNIX machines as well),
nor by inodes, but by file handles. A file handle is an opaque
identifier for a file. File handles are returned by the server to a client after
the client creates a new file or directory, or looks up a file by its component
name (not pathname). In each case, the directory in which the operation is
performed is identified by a file handle. The file handle for the root of the
exported filesystem (called the root file handle, which has nothing to
do with the user called root) is obtained via a completely separate
protocol called mount. This latter protocol is the one that does all of
the permission checking to see if the client is allowed to access the exported
filesystem, and if it is, the root file handle is returned to it. This means
that if you can find out the root file handle for some exported disk (often you
can do this just by having a user account on a machine which mounts the disk;
for example, the file handle for /var/spool/mail on orodruin is
0000000000001e0f00080000100021cf3a45000000080000100021cf3a450000), you can go
home to your Linux box and mount the disk, often getting complete read/write
access (except sometimes for files owned by root; see below). File
handles, which are supposed to be opaque, are composed of a filesystem id, and
the inode number and inode generation number of each of the file being
referenced and the root directory (a generation number is an entry in an inode
structure which is incremented when an inode is freed; this prevents a file
handle from referencing the wrong file when the associated file is deleted, and
a different file gets its inode number). By using this fact, it is usually
possible to mount an entire remote partition from an NFS server, even if only
part of the parition was meant to be exported.
When a user on the local machine accesses a remote file, the local kernel
sends an RPC request which contains the file handle of the file, the numerical
uid and gids of the user, the request (read, rename, etc.), and any necessary
arguments. The server uses the uid and gids in the request to determine whether
to grant access to the file, using normal Unix permission semantics, with the
exception that root on the local machine is given the permissions of
nobody on the remote machine (this seems to be the limit of the thought
that was put into security). This requires a global uid/gid namespace, and also
complete trust that a user (possibly with root access, but not
necessarily) on one client won't send an RPC request to the server, claiming he
is a different user. The server eventually returns a result (after synchronously
updating its disk image) and the local kernel then returns from the system call
that initiated the RPC. If the server has crashed, the local client will hang,
periodically retransmitting the request, until the server comes back up and
replies to it.
Implementation
In order to make NFS transparent to user programs, the
concept of an inode was abstracted away (all problems in computer science...) to
that of a VFS and a vnode. A VFS represents a mounted file
system of any type (local, MSDOS, NFS, etc.). Each vnode represents one file in
a VFS, and a set of operations (see the paper) are defined that are assumed to
be all-encompasing for any file system you would want to mount. The advantage of
the VFS/vnode system is that a single machine could have multiple types
of file systems mounted simultaneously, in what appears to be a single directory
hierarchy.
Initially, the NFS server was also part of the kernel. In order that it may
block when no requests are pending, or when going to disk, a user-level process
called nfsd was run, which did a single system call that never
returned. This user state is used by the kernel to allow the server to sleep. (A
similar trick was used until recently by the bdflush daemon in the
Linux kernel; the correct solution is to use kernel threads, which Linux has
done. Nowadays, nfsd is actually a user-level process on most systems,
so the point is moot.) On the client side, a block I/O daemon (biod)
does the same thing to provide a context to sleep in when the (blocking) RPC
read-ahead and write-behind requests are not yet complete. This is done so that
simultaneous requests can be made and handled.
Non-implementation
Several (arguably desirable) features were not
included in the initial version of NFS:
Root filesystems
You could not have your / directory be on a
remote machine. The arguments in the paper argue against having / on
many clients be mapped to the same directory on the server, but there's nothing
wrong with client A's / being mounted from /usr/exports/A on
the server, and client B's / being mounted from /usr/exports/B
on the server. More recently, this has, in fact, been achieved; a Linux system
can be booted with just a network card and a boot PROM.
Filesystem Naming
The name of an exported directory on a server has
nothing to do with the name of the directory on which it gets mounted on the
client; the client is free to mount it on top of any directory (even another
remote one). What happens if clients are also servers, and machine A has a local
directory /usr/A, and machine B has a local directory /usr/B,
and A mounts B:/usr/B onto /usr/A, and B mounts
A:/usr/A onto /usr/B? An access to /usr/A on A would
go back and forth from machine to machine, trying to resolve what directory is
actually there. NFS avoided this problem by not re-exporting directories. In the
above example, after the mounts, /usr/A on A would appear to contain
the original contents of /usr/B on B, and vice versa. A couple of NFS
daemons today have options to follow mount points when exporting filesystems;
it's up to the admins not to create loops like this.
Security
As mentioned above, the only implemented security is the
mapping of root to nobody. All this does is prevent
root on a client machine from accessing files that only root
on the server machine can access. If any other uid can access a file on the
server, root on the client can just change to that user, and access the
file. More recently, tiny improvements have been made. For example, most
versions of mountd have an option (though it's often not used, as it
disallows valid mounts from some older systems, like Ultrix) to only
allow mount requests from port numbers reserved for root. This prevents
regular users on a machine that happens to be able to mount a file system from a
remote server from being able to find out the root file handle of the file
system. Also, very very recently (I've only seen the Linux nfsd do
this), nfsd will reject packets from clients that aren't listed in the
exports file; before this, it was assumed that if a client knew a file handle,
it must have been previously verified by mountd, so it was OK.
File Locking
NFS doesn't have any; that would be state. Sun made a
separate daemon called lockd which handles lock requests. I've found
it's not very useful in a heterogeneous environment. Similarly, there's a
problem with multiple concurrent writers: they tend to overwrite each other's
data, as there is no way to atomically append to a file.
Open File Semantics
Under Unix, access to a file is checked when it is
opened. After that point, the file is supposed to be able to be made unreadable
or unwritable, or even deleted entirely, and access to it should still work.
Many programs open a temp file and immediately delete it, writing and reading
from the open file handle. In order to make this work over stateless NFS, a
client that had a remote file open would never ask the server to remove it, as
long as it stayed open; it asked to have it renamed (to a temporary name)
instead. When the file was closed, the temporary name was removed. This only
solved the "deletion" problem, and only for a single client; if one client has a
remote file open, and another client deletes it, the first client loses access.
Consistency
The paper does not mention this at all in the context of
simultanoues access, but from other papers (and personal experience), we know it
was basically ignored. Changes made by one client may not show up in another
client for up to 30 seconds.
Concluding Remark
This is a perfect example of why not to
ignore security when designing a system. Security is not a "feature" to be added
in later; it must be an integral part of your system design.