6.033 Discussion Suggestions

6.033--Computer System Engineering

Suggestions for classroom discussion

Topic: Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. Flash: An efficient and portable Web server. Proceedings of the 1999 Annual Usenix Technical Conference, pages 199-212.

By J. H. Saltzer, 24 February 2002. Revised heavily 10 March 2002; minor typos fixed February 2003.

This paper takes some care to discuss, because at times it seems like we are reading it a bit early. Although the bulk of the paper concerns threads and caches, it also depends on a small amount of knowledge about networks, which we won't start investigating for another week or so, and the discussion of pathname translation gets into naming, which will come up even later. The best way to deal with these problems is to postpone discussions about details of network mechanics, rather than trying to explain them now. As questions about the network come up, write them on the board, promise to explore them in depth in a couple of weeks, and figure out a way to finesse in order to keep the discussion of the current topics moving.

The reason we are reading this paper is that it describes a set of resource management design decisions, applying techniques that we are studying in chapters two and three of the notes. Because it makes use of practically every idea that has come up in 6.033 so far, it can easily occupy discussion for two full recitation meetings.

Buzzword patrol (words and acronyms that the paper uses without defining, and to which the class has not yet systematically been exposed):
- 1. API (ask about this one)
- 2. HTTP (bookmark)
- 2. TCP (bookmark)
- 2. socket (bookmark, meanwhile, think of a telephone line)
- 2. connection (bookmark, meanwhile think of a telephone call)
- 3.1 block (= wait)
- 3.2 kernel thread (discussed below)
- 3.3 event-driven (discussed below)
- 3.4 select (the way UNIX allows you to check for signals from several I/O devices at once and wait if none of them has signaled yet.)
- 3.4 mapped memory, mmap, mincore (This is the most significant problem. See below.)
- 5.1 CGI (finesse; a way of creating a page on the fly)
- 5.5 cache line (probably not recalled from 6.004)
- 6 Fast Ethernet (bookmark, think 100 Megabits/second)
- 6.2 dataset (a collection of data)
- 6.4 LAN (bookmark; "Local Area Network)
- 6.4 WAN (the expansion appears but it isn't tied to the acronym)
- 6.4 ECE [a stumper. "Electrical and Computer Engineering," the name of their department at Rice.]
- 7 DNS (bookmark; the domain name system used in the Internet)
How do you evaluate the performance of a web server?
(In the Introduction, the authors talk about a single throughput measure: requests served per second. [Later they silently switch to the term "connection rate" for this measure.])
And is that what the authors of this paper measure?
(Yes, but they also measure something else. The performance charts report two distinct things. The first is "bandwidth", which is a misnomer for output data rate. The second is connection rate.)
Why measure two things?
(Because when clients request mostly large files, the bottleneck is the rate at which the server can transfer data from the disk to the network, but when clients request mostly small files, the bottleneck is the fixed cost of handling a request. A server architecture that is good at handling the first may not be good at handling the second; you need to measure both.)
What are the components of the fixed cost?
(Receiving the request, interpreting the request to identify the file needed, looking up the file first in the cache and then in the file system, and finally sending at least one response message to the client.)
(suggested by Saman Amarasinghe) At this point you might try the following exercise to get discussion going: print out the five slides in this pdf file, break your class up into five groups, give each group one slide, and tell them that each group is a marketing team that has ten minutes to work up a presentation about why their web server architecture is better than any of the other four.
What is the high-level strategy of the Flash architecture?
(The main server program hands off to parallel ("helper") processes anything that might require waiting so that the main program can get on to do something else.)
But that is what threads are for. Why don't they use threads?
(Because threads as supplied by UNIX are defective. Processes, which consist of one thread plus one address space, are too heavyweight. There does exist a thread package that does the right thing, but it isn't universally available on all UNIX systems. The 6.033 interest in this paper is that it reports the rationale for certain engineering decisions made under a particular set of design constraints. The real world is always a bit messy.
[A less charitable way of of putting this is that the main contribution of this paper is "Here is a way to program around a missing feature of UNIX." Many USENIX papers are of this same form, and the systems community tends to deprecate them. But that doesn't make the paper any less interesting for our purposes.])
What is the main source of waits that they are avoiding?
(Waits on the disk for data that isn't cached in RAM.)
Is this a real concern? Today I can buy RAM for $100/gigabyte (PC-133 DIMM's, 2/24/03). Why not just buy 5 gigabytes of RAM and load the entire web server's database into RAM as part of server initialization?
(It is often helpful to distinguish technology problems from intrinsic problems. They are battling a technology problem. This is a 1999 paper. At that time, RAM cost about 10 times as much as it does today. And their server probably didn't have enough hardware slots to install 5 Gbytes of RAM. Today this may be a niche problem that needs solving only for servers with unusually large data bases. But in 1999 it was widely applicable.)
Why is the distinction between technology and intrinsic problems useful?
(Because some technology problems vanish if you can wait for the technology to improve. If you can't wait, they may be candidates for temporary worse-is-better solutions.)
Section 3.1, second paragraph. What is the following sentence trying to say?
However, it may be more difficult to perform optimizations in this architecture that rely on global information, such as a shared cache of valid URLs.
- More difficult than what? (It doesn't say; there is no antecedent with which to compare.)
- What is the antecedent of "such as a shared cache"? (Ambiguous. Apparently, the antecedent is "information", but it could equally well be "optimizations".)
- What is the antecedent of "that rely on global information". (Formally, "this architecture", but that doesn't make sense, and the number doesn't match. Apparently, "optimizations".)
- How should it be rewritten?
However, it may be difficult in this architecture to perform optimizations (such as a shared cache of valid URLs) that rely on global information.
What does "event-driven" mean?
```
(
      do forever
      {
      while nothing to do
         {
         wait
         }
      figure out what event has happened
      call that event handler to deal with it
      }
```
The main point is that there may be many different event handler programs, each of which deals with a different event, but there is exactly one call to wait in the entire server program. The event handler programs never call wait or do anything that would cause a wait on their own.)
How does this architecture relate to threads?
(It is a kind of specialized and restricted thread package that can be implemented entirely by the application with no support from the kernel. It is specialized in that the central coordinator knows about all possible events and handlers, in contrast with the thread manager, which is unaware about what the threads are doing.)
In what way is it "restricted"?
(This is significant; the paper never mentions it. Don't try AMPED or SPED on a Macintosh OSX server, or any other machine with multiple processors, because the extra processors won't help. The event-driven architecture requires that only one processor be allowed to run the server thread. They mention that a server with MT architecture would require interlocks. Event-driven architectures avoid interlocks, but at the price of not allowing multiple processors. The paper should have mentioned this restriction in section 4.2, where it compares architectures, but it doesn't. More on this when we come to 4.2.)
What does it mean in section 3.3, paragraph 4, when it says that "non-blocking read and write operations...may actually block when used on disk files."?
(A glitch in UNIX. The designer of the read and write operations thought that they should be real procedure calls, which don't return till the operation is complete.)
But then it says that there are alternate calls that are asynchronous. Why not use those?
(More glitches in UNIX. Remember that the event-driven architecture has just a single call to the kernel wait entry. For that to work, you want a wait that will return if any event occurs: a disk operation completes, a parallel process writes something to a pipe, a new network request arrives. But UNIX doesn't provide that kind of wait. And the following sentence suggests that you can't easily program around it.)
What are mmap and mincore all about?
(An alternative way of reading files if you have a virtual memory with demand paging, The idea is that instead of saying "open(...), read(...), write(...), close(...)" the programmer says "p = map(...)". The operating system returns as the value p an address in virtual memory where the file can be found. It isn't in physical memory yet, but if the program touches that part of virtual memory, the demand paging system will bring it in, one page at a time. So the programmer can read the file as if it were a variable of the program, using loads and stores just as with any other data in RAM. mincore is a related function; if you give it a pointer to an address in virtual memory, it returns True iff that page of VM is in real memory.)
What is the attraction of mapping files into VM?
(First, performance. If done correctly, it can avoid one copy of the data. Usually the kernel "read" interface reads the data into a kernel buffer, then copies it into the place where the user really wanted it. With mapping, the data can go directly from the disk into an unused block of RAM, and that block can be handed over to the user by adjusting one page map entry. Second, the application code is usually much simpler. The program can read or write the data using named program variables.)
What is the down side?
(For the web server application it conceals the disk interface a little too well. You can't tell what is actually in physical memory. The mincore call tries to work around this problem, but it provides only a hint, since the page removal algorithm may remove the page by the time you try to read it. And, it pretty much eliminates any hope you might have had of controlling when your thread calls wait, because the calls to wait for disk activity are buried inside the demand paging manager.)
In 4.2 it says "The MT model uses a single cache, but the data accesses/updates must be coordinated...Both AMPED and SPED can use a single cache without synchronization". How is it that AMPED and SPED manage to avoid the need for synchronization?
(Because an event-driven system is restricted to a single processor. In effect, this sentence turns the restriction on its head and claims it as a feature. That doesn't make event-driven less useful, and given the set of constraints they were working under, which included running on a single-processor machine, event-driven is a reasonable engineering choice.)
What is pathname translation caching?
(The explanation in the paper is quite obscure, and on top of that we haven't discussed naming yet, so it is unlikely that very many students figured this one out. Most URL's contain, among other things, the name of a file. But in many cases that name is not identical to the local file system name for that file. After performing the translation, this cache holds the result, so that it doesn't have to be figured out again. The cache is probably important only for relative URL's, but the concepts needed to discuss that idea won't be available till just before spring vacation.)
Near the end of section 5.5 the paper refers to "32-byte cache lines". Which of the three caches (5.2 pathname translation, 5.3 response header, and 5.4 mapped files) is this referring to?
(None of the above. This is a reference to some processor cache. Since it is in the middle of a discussion of application caches, the paper should have been more explicit.)
Why does it say in section 5.4 'We use LRU to approximate the "clock" page replacement algorithm.'
(This is really mysterious at first. The problem is that because they are using the mapping interface to read files, they can't tell what pages are actually in physical memory. They know the kernel is using a clock algorithm to decide which pages to remove, and they are trying to guess which pages that algorithm has removed. Since clock is a rough approximation of LRU, they do an exact LRU and hope that it predicts the same page removals as the kernel's clock. This is probably the creakiest part of the design.)
Back in section 4.2 it said something about cache memory competing with the filesystem cache for physical memory. What is the problem here?
(Both this and the LRU approximation are examples of too many cooks in the kitchen. We have a good opportunity for a chapter 4 end-to-end argument here: The application knows exactly what is worth caching, but the operating system is trying to second guess the application. This comes up both in the existence of the file cache and also in the management of virtual memory pages.)
What are the main messages that one can read out of figures 6 through 12?
(
1. Apache is pretty clunky under all conditions. If you need performance, the other implementations are all better.
2. Buy lots of RAM. If the data fits in RAM, the architecture doesn't matter that much. The first sentence of section 6.2 points out that with large RAM, the data of figure 6, which shows almost no difference among architectures, applies. But still avoid Apache.
3. If you need the last ounce of performance, MT + RAM is the way to do it under almost every condition they tested. It is simpler than AMPED, the performance is the same or better, and it allows multiple processors to be applied to the job.
)
[suggested by Hari Balakrishnan] Why is the "Web server" design so different in feel from the "X server" design? Aren't they all just "servers" doing similar things, dealing with clients that might be on the other side of a network?
(The short answer is that X optimizes interactive response latency (the time to complete an individual request), while Flash (and other such Web servers) optimizes system throughput, the total number of requests handled per second. As a result, X worries a lot about how to handle network round-trip times of tens of milliseconds and uses pipelining on a per-client basis, to avoid overall latencies that are too much worse than the underlying network latency for any given request.
In contrast, Flash doesn't really worry about bounding the latency of any *given* request. Instead, it worries about how many it can serve overall in unit time, and while it leverages caching to reduce response time, what it's really doing is increasing the *capacity* of the system compared to the case when blocked file I/O events occur.
Each of these optimizations is appropriate for the domain in which the design operates.)
Give a specific example of where Flash exchanges larger latency for higher throughput.
(Rather than directly posting a disk I/O, the Flash server stuffs the I/O request into a pipe to another process, so the client must experience an extra delay waiting for that other process to get to run. And there is a similar scheduling delay on signalling the completion of the I/O. But in return for this delay, the Flash server can start working on another incoming request.)
So what is AMPED good for, after all this?
(As a temporary workaround if (a) you can't afford enough RAM and (b) your UNIX doesn't support multiple threads in a single address space. After all is said and done, this paper is about engineering under constraints. For the constraints these authors set out to meet, AMPED does the job better than anything else.)
Are there any other applications, perhaps less temporary?
([suggested by Hari Balakrishnan]. Yes. Consider a streaming media server that provides access to, for example, a movie archive. In this case, there is going to be more than a temporary wait to afford enough RAM to cache the entire archive. So your choices are pretty much AMPED or finding an operating system that properly supports multiple threads.)

Comments and suggestions: Saltzer@mit.edu