By J. H. Saltzer, 24 February 2002. Revised heavily 10 March 2002; minor typos fixed February 2003.
This paper takes some care to discuss, because at times it seems like we are reading it a bit early. Although the bulk of the paper concerns threads and caches, it also depends on a small amount of knowledge about networks, which we won't start investigating for another week or so, and the discussion of pathname translation gets into naming, which will come up even later. The best way to deal with these problems is to postpone discussions about details of network mechanics, rather than trying to explain them now. As questions about the network come up, write them on the board, promise to explore them in depth in a couple of weeks, and figure out a way to finesse in order to keep the discussion of the current topics moving.
The reason we are reading this paper is that it describes a set of resource management design decisions, applying techniques that we are studying in chapters two and three of the notes. Because it makes use of practically every idea that has come up in 6.033 so far, it can easily occupy discussion for two full recitation meetings.
(In the Introduction, the authors talk about a single throughput measure: requests served per second. [Later they silently switch to the term "connection rate" for this measure.])
(Yes, but they also measure something else. The performance charts report two distinct things. The first is "bandwidth", which is a misnomer for output data rate. The second is connection rate.)
(Because when clients request mostly large files, the bottleneck is the rate at which the server can transfer data from the disk to the network, but when clients request mostly small files, the bottleneck is the fixed cost of handling a request. A server architecture that is good at handling the first may not be good at handling the second; you need to measure both.)
(Receiving the request, interpreting the request to identify the file needed, looking up the file first in the cache and then in the file system, and finally sending at least one response message to the client.)
(suggested by Saman Amarasinghe) At this point you might try the following exercise to get discussion going: print out the five slides in this pdf file, break your class up into five groups, give each group one slide, and tell them that each group is a marketing team that has ten minutes to work up a presentation about why their web server architecture is better than any of the other four.
(The main server program hands off to parallel ("helper") processes anything that might require waiting so that the main program can get on to do something else.)
(Because threads as supplied by UNIX are defective. Processes, which consist of one thread plus one address space, are too heavyweight. There does exist a thread package that does the right thing, but it isn't universally available on all UNIX systems. The 6.033 interest in this paper is that it reports the rationale for certain engineering decisions made under a particular set of design constraints. The real world is always a bit messy.
[A less charitable way of of putting this is that the main contribution of this paper is "Here is a way to program around a missing feature of UNIX." Many USENIX papers are of this same form, and the systems community tends to deprecate them. But that doesn't make the paper any less interesting for our purposes.])
(Waits on the disk for data that isn't cached in RAM.)
(It is often helpful to distinguish technology problems from intrinsic problems. They are battling a technology problem. This is a 1999 paper. At that time, RAM cost about 10 times as much as it does today. And their server probably didn't have enough hardware slots to install 5 Gbytes of RAM. Today this may be a niche problem that needs solving only for servers with unusually large data bases. But in 1999 it was widely applicable.)
(Because some technology problems vanish if you can wait for the technology to improve. If you can't wait, they may be candidates for temporary worse-is-better solutions.)
However, it may be more difficult to perform optimizations in this architecture that rely on global information, such as a shared cache of valid URLs.
However, it may be difficult in this architecture to perform optimizations (such as a shared cache of valid URLs) that rely on global information.
( do forever { while nothing to do { wait } figure out what event has happened call that event handler to deal with it }The main point is that there may be many different event handler programs, each of which deals with a different event, but there is exactly one call to wait in the entire server program. The event handler programs never call wait or do anything that would cause a wait on their own.)
(It is a kind of specialized and restricted thread package that can be implemented entirely by the application with no support from the kernel. It is specialized in that the central coordinator knows about all possible events and handlers, in contrast with the thread manager, which is unaware about what the threads are doing.)
(This is significant; the paper never mentions it. Don't try AMPED or SPED on a Macintosh OSX server, or any other machine with multiple processors, because the extra processors won't help. The event-driven architecture requires that only one processor be allowed to run the server thread. They mention that a server with MT architecture would require interlocks. Event-driven architectures avoid interlocks, but at the price of not allowing multiple processors. The paper should have mentioned this restriction in section 4.2, where it compares architectures, but it doesn't. More on this when we come to 4.2.)
(A glitch in UNIX. The designer of the read and write operations thought that they should be real procedure calls, which don't return till the operation is complete.)
(More glitches in UNIX. Remember that the event-driven architecture has just a single call to the kernel wait entry. For that to work, you want a wait that will return if any event occurs: a disk operation completes, a parallel process writes something to a pipe, a new network request arrives. But UNIX doesn't provide that kind of wait. And the following sentence suggests that you can't easily program around it.)
(An alternative way of reading files if you have a virtual memory with demand paging, The idea is that instead of saying "open(...), read(...), write(...), close(...)" the programmer says "p = map(...)". The operating system returns as the value p an address in virtual memory where the file can be found. It isn't in physical memory yet, but if the program touches that part of virtual memory, the demand paging system will bring it in, one page at a time. So the programmer can read the file as if it were a variable of the program, using loads and stores just as with any other data in RAM. mincore is a related function; if you give it a pointer to an address in virtual memory, it returns True iff that page of VM is in real memory.)
(First, performance. If done correctly, it can avoid one copy of the data. Usually the kernel "read" interface reads the data into a kernel buffer, then copies it into the place where the user really wanted it. With mapping, the data can go directly from the disk into an unused block of RAM, and that block can be handed over to the user by adjusting one page map entry. Second, the application code is usually much simpler. The program can read or write the data using named program variables.)
(For the web server application it conceals the disk interface a little too well. You can't tell what is actually in physical memory. The mincore call tries to work around this problem, but it provides only a hint, since the page removal algorithm may remove the page by the time you try to read it. And, it pretty much eliminates any hope you might have had of controlling when your thread calls wait, because the calls to wait for disk activity are buried inside the demand paging manager.)
(Because an event-driven system is restricted to a single processor. In effect, this sentence turns the restriction on its head and claims it as a feature. That doesn't make event-driven less useful, and given the set of constraints they were working under, which included running on a single-processor machine, event-driven is a reasonable engineering choice.)
(The explanation in the paper is quite obscure, and on top of that we haven't discussed naming yet, so it is unlikely that very many students figured this one out. Most URL's contain, among other things, the name of a file. But in many cases that name is not identical to the local file system name for that file. After performing the translation, this cache holds the result, so that it doesn't have to be figured out again. The cache is probably important only for relative URL's, but the concepts needed to discuss that idea won't be available till just before spring vacation.)
(None of the above. This is a reference to some processor cache. Since it is in the middle of a discussion of application caches, the paper should have been more explicit.)
(This is really mysterious at first. The problem is that because they are using the mapping interface to read files, they can't tell what pages are actually in physical memory. They know the kernel is using a clock algorithm to decide which pages to remove, and they are trying to guess which pages that algorithm has removed. Since clock is a rough approximation of LRU, they do an exact LRU and hope that it predicts the same page removals as the kernel's clock. This is probably the creakiest part of the design.)
(Both this and the LRU approximation are examples of too many cooks in the kitchen. We have a good opportunity for a chapter 4 end-to-end argument here: The application knows exactly what is worth caching, but the operating system is trying to second guess the application. This comes up both in the existence of the file cache and also in the management of virtual memory pages.)
(The short answer is that X optimizes interactive response latency (the time to complete an individual request), while Flash (and other such Web servers) optimizes system throughput, the total number of requests handled per second. As a result, X worries a lot about how to handle network round-trip times of tens of milliseconds and uses pipelining on a per-client basis, to avoid overall latencies that are too much worse than the underlying network latency for any given request.
In contrast, Flash doesn't really worry about bounding the latency of any *given* request. Instead, it worries about how many it can serve overall in unit time, and while it leverages caching to reduce response time, what it's really doing is increasing the *capacity* of the system compared to the case when blocked file I/O events occur.
Each of these optimizations is appropriate for the domain in which the design operates.)
(Rather than directly posting a disk I/O, the Flash server stuffs the I/O request into a pipe to another process, so the client must experience an extra delay waiting for that other process to get to run. And there is a similar scheduling delay on signalling the completion of the I/O. But in return for this delay, the Flash server can start working on another incoming request.)
(As a temporary workaround if (a) you can't afford enough RAM and (b) your UNIX doesn't support multiple threads in a single address space. After all is said and done, this paper is about engineering under constraints. For the constraints these authors set out to meet, AMPED does the job better than anything else.)
([suggested by Hari Balakrishnan]. Yes. Consider a streaming media server that provides access to, for example, a movie archive. In this case, there is going to be more than a temporary wait to afford enough RAM to cache the entire archive. So your choices are pretty much AMPED or finding an operating system that properly supports multiple threads.)
Saltzer@mit.edu