6.033 Discussion Suggestions: NFS paper

6.033--Computer System Engineering

Suggestions for classroom discussion

Topic: R[ussel]. Sandberg, D[avid]. Goldberg D[an]. Walsh, and B[ob] Lyon. Design and Implementation of the Sun Network Filesystem. Proceedings of the Summer 1985 USENIX Conference & Exhibition, Portland, Oregon, June 1985, pp 119-130.

By J. H. Saltzer, 18 March 2002. Minor update 16 March 2004.

We are reading this paper now because it provides an example of an end-to-end protocol that does something useful. Unfortunately, the paper is couched in a quantity of UNIX jargon and detail that will exceed the quota of all but the most determined readers. The material of interest is all in the first five pages, and the most interesting things from a network point of view are all exposed in the first three pages.

A second challenge is that until reading Appendix A of chapter 5, the only clue to the way the UNIX file system works is from reading of Ritchie & Thompson, which didn't give any details.

Should we take this paper as an example of excellent technical communication? (Heavens, no!)
List some ways that it falls down. (Where to start?...
Excessive use of jargon.
Assumes total recall of very detailed knowledge of UNIX
Organization reduces to a laundry list after the client side section
Spelling errors
Five authors, but uses "I"
Acknowledgement to some of the authors
All references except Kepecs are to SUN internal docs or Sun publications)
The design goals all seem to be motherhood. Why? (In 1985 there were a half-dozen other competitors for the title of remote file system category killer. The goals here are specifically targeted at the weaknesses of some of those competitors.)
Explain the goal of "Crash Recovery". Why the concern about "many different servers"? (Their model is that a typical workstation will mount file systems from several different servers containing, e.g., command libraries on one server, mail queues on a second, printer queues on a third, my personal files on a fourth, etc. If you are busy editing one of your personal files at the moment that the server holding the printer queues crashes, you don't want to either (1) have to reboot and possibly lose the editing you were doing, or (2) have to stop work until the crashed server recovers.)
How about the components of "Transparent access", "no pathname parsing", "no special libraries", "no recompiling"? (Pathname parsing: Some competitive remote file system designs required that the client present a file name in a syntax different from that of UNIX, e.g. with slashes replaced with colons or with the name of the server prefixed to the file name. With those systems, people either had to remember to use the special syntax when using remote files, or write scripts that transformed filenames. No special libraries: some of the competitors consisted of replacements for library programs such as open, read, write, and close. To use those it was necessary to reload each file-using application program with this special library. No recompiling: some of the competitors couldn't even work with standard binaries; to use them, each file-using application had to be recompiled with different .h files.)
The first design goal includes operating system independence. The fourth requires maintaining UNIX semantics. Aren't these contradictory? (Absolutely. They are walking a tightrope here. They want to allow participation by other systems, especially clients running other systems, as long as it doesn't compromise the requirement that the UNIX clients are happy. All operating systems are equal but UNIX is primus inter pares (first among equals).)
What do they mean when they say "stateless"? If I write my file on a file server, I expect to be able to read it back someday. Doesn't that mean that the file server maintains state? (Certainly. What they really mean is that every network protocol request contains all the information needed to carry out that request, without relying on anything remembered from previous protocol requests.)
What is the advantage? (Every NFS operation must be idempotent, which supports the crash recovery goal. If the server crashes and restarts between protocol requests, as long as no files are lost the client will never know.)
What is the down side? Can you think of any Unix file system operations that might be problematic? (This question is hard at this point in the term, because of lack of familiarity with Unix file system semantics. The simplest example is read().)
How does UNIX read() map into NFS read()?
```
UNIX:
     read(filedescriptor, *buffer, size_t);

     read()  starts  at  a position in the file given by the
     file offset associated with  filedescriptor.   The  file  offset  is
     incremented by the number of bytes actually read.
NFS:
     read(fh, offset, count) returns (attr, data)

     read() returns up to count bytes of data from a file starting
     offset bytes ito the file.
```
So what is the problem? (There are two issues. The first one, fairly trivial, is marshalling--also called presentation. That is, we have to translate filedescriptor into fh, size_t into count, and when data comes back copy it into buffer.
The second issue is the interesting one; it is a good illustration of what stateless means. UNIX keeps track of how much of the file you have read, and each read starts from where you left off last time. NFS requires that you tell it, each time, where you want to start. The "file offset", sometimes called a "cursor" is an example of state that the UNIX file system remembers from one call to the next.)
But NFS doesn't remember things from one call to the next. So how can we possibly translate from UNIX semantics to NFS semantics? (Have the client remember the file cursor state variable. When a UNIX read call comes in, use the cursor value as the offset in the call to NFS, and when the data comes back add count to to the cursor so that you are ready for the next read call.)
The list of protocol procedures includes read and write. Why isn't there an open and close? (Because that wouldn't be stateless. The interesting thing about the NFS procedure interface is that it is competely, 100%, totally stateless. Every operation is idempotent.)

The idea of a stateless protocol is perhaps best illustrated by comparing it with AFS, which uses a stateful protocol. The key differences between AFS and NFS are summarized in this slightly simplified table that shows the interaction between the client and the server:

Application
says NFS AFS

Open(filename) return file handle copy file to the client

Read(data) get data from server none

Write(data) put data to server none

Close() none copy file back to server

The slight simplification in that table is that the NFS client has a block cache that sometimes avoids the need for the action "get data from server" and sometimes delays the action "put data to server" for up to 30 seconds.

What state is the AFS server holding? (It is remembering that the client has checked out a copy of the file, so that it can tell other prospective clients whether or not they can open the file. If the AFS server crashes and restarts, the clients have to notice the restart, and remind the AFS server about the files they have checked out. If a client crashes, the AFS server has to detect it and mark that client's files as no longer checked out. Warning: we are edging in to recovery, a chapter 8 topic, here.)
In talking about the virtues of being stateless, it says "If a client just resends requests until a response is received, data will never be lost due to a server crash." That is true, but it hides an interesting problem. The way they implemented this was to simply set a timer at the client. When the timer expires, resend the request. Keep doing this forever, if necessary (hard mount) or for a few minutes (soft mount) before giving up. What is the problem with this approach? (The problem is choosing a timer value.)
So let's see what happens. Suppose the timer value is set to 150 ms. Will that work? (If the server is on the same Ethernet as the client it will probably get the message in a millisecond or so, and it takes maybe 20 milliseconds to do the disk read and send the data. No problem.)
What happens when we load the server up a bit. Suppose it uses synchronous file I/O, so it can handle 50 requests/second. Suppose each client makes 1 request per second. How many clients can it handle? (Looks like 50, but that turns out to be optimistic.)
Suppose we have 45 clients, making 45 requests/second. What fraction of the capacity of the server are we using? (90% of its capacity.)
And how long are its queues? (1/(1-r) = 1/(1-.9) = 10)
So how long does an average request wait in the queues? (10 * 20 ms = 200 ms)
What is going to go wrong? (That client will time out at 150 ms. and resend its request.)
What is wrong with that? That looks like our standard problem that sometimes duplicates happen. (Yes, but look at it from the server's point of view: It has two identical requests from the same client in its queue.)
So why doesn't it do a standard deduplication, just like we saw in lecture? (Because it is stateless!)
Uh-oh. So it performs the read twice and the client has to ignore the second response. No big deal, right? (Wrong. Every request will find ten things ahead of it in the queue. Each client will time out and send a duplicate. We have doubled the load on the server, which means that it is getting only half as much real work done as we hoped. The clients are presenting a load of 180% of its capacity. The queue is now growing. When the queue length gets to 15 the average wait will be 300 ms and clients will start to time out a second time and resubmit their requests yet again. So now we are tripling the load, and only 1/3 of the work of the server is actually useful. And the queues are still growing. In a little while the server will be doing nothing but redoing work it has already done several times. We have a spectacular example of the congestion collapse discussed in section F of the notes.)
Congestion collapse is caused when resources get wasted as the load increases. Where is the waste? (As the load increases, the server wastes more of its time performing duplicate requests.)
So let's tell the clients to back off by doubling their timeouts. Does that help? (Not if the queue already contains three copies per client. Congestion collapse is stable!)
How do we fix it? (Exponential backoff. The client should, each time it times out, double its timer interval. Eventually that will lower the overall load to a point that the server can empty the queue)
There was another problem in the initial NFS that isn't mentioned in the paper. There was a bug in the 4.2 BSD implementation of UDP such that it calculated checksums wrong in certain cases. As a workaround, NFS turned off UDP checksums. But it still worked just fine. Why? (When first deployed, the server and the client were always on the same Ethernet, because routers were so underpowered that you couldn't consider running a file server through a router. And Ethernets have their own CRC's, so bad data is never delivered to the client.)
How long do you expect this to be a stable situation? (Until routers started getting fast enough that people could put them in the path between the server and the client. Since the Ethernet CRC protects data only while it is on the wire, bus errors, hardware problems in routers and network interfaces, and software errors in the link and network protocols will show up as bad data in your NFS file. Once there are a couple of routers in the path, the probability of this class of error starts to rise from the undetectable level to the annoying level. So it was necessary to turn UDP checksums back on (after fixing the bug in 4.2 BSD checksum calculations.))

Comments and suggestions: Saltzer@mit.edu

Application says	NFS	AFS
Open(filename)	return file handle	copy file to the client
Read(data)	get data from server	none
Write(data)	put data to server	none
Close()	none	copy file back to server