6.033 Discussion Suggestions

6.033--Computer System Engineering

Suggestions for classroom discussion

Topic: Gregory R. Ganger and M. Frans Kaashoek. Embedded inodes and explicit grouping: Exploiting disk bandwidth for small files. Proceedings of the Usenix Association 1997 Annual Technical Conference, pages 1-17.

By J. H. Saltzer, April 9, 1997, with much help from Greg Ganger.

N.B.: This paper is new to 6.033 in 1997, so there is no experience yet using it in a recitation. The following material is assembled from a series of e-mail exchanged with Greg in January, when we were discussing whether or not to try this paper this year.

This paper is generally quite accessible with the terminology and concepts that will be available late in the term, it is not nearly as dense as some of the file system papers we have tossed out, and it is focused on one topic. Probably the only serious hitch is that it is written assuming a fair amount of familiarity with FFS, which the class doesn't know and which, while mentioned in Tanenbaum, isn't really treated in enough depth there to understand this paper.

The paper also lays the ground work for lots of interesting discussions about how to design file systems. The high-level picture (not articulated in the paper) is that we have a disk with particular physical properties (input number 1), and we have an application that, left to its own, would generate particular access patterns (input number 2). The exercise for the file system designer is to provide a mapping of data to the disk that gets the highest performance for the two given inputs.

Unfortunately noone knows how to do this mapping in a systematic way--or even how to characterize the access pattern input. So our approach is ad hoc. That isn't necessarily bad, but it is unsatisfying.

There is a plausible-sounding, but somewhat off-the-wall, sentence in section 2.1, paragraph 3, "The dominating performance characteristic of modern disk drives is that the per-request cost is much larger than the per-byte cost." What this sentence means to say is that for a typical disk transfer requested by a standard-design file system, the setup time is much larger than the data transmission time. Not all modern file systems have this problem: My Macintosh reads and writes 9 Gbyte disks in 140 Kbyte blocks, so for that case the transmission time dominates the setup time.

The paper comes across as a "We found another way to improve on UNIX" paper, rather than as contributing fundamental insights into how you ought to design file systems to take best advantage of current technology. Why? (The intended audience: This paper was written for a USENIX conference. If it had been written for SOSP, then it would have been more general--or not accepted. So there is an extra challenge for the reader of extracting the general concepts from the specific.)

There is a possibility that IBM has explored this kind of stuff in great detail. One might expect that Jim Gray would take one look at this paper and say, "Oh, yes, this is the same idea they used in KICS-2 Fastpath back in 1979" or something similar. IBM has had a cast of thousands in San Jose working for 30 years trying to figure out how to lay out data to minimize seeks and maximize application performance. A lot of that stuff got reported in the IBM Systems Journal under the topic labeled "Data base management systems".

The fundamental message hiding in the disk specs in Table 1, but never explicitly stated, is that it takes an average of 8 ms to seek to a track, and the rotational delay to get just one bit from that track averages 4 ms. So reading the minimum amount of data requires a typical latency investment of 12 ms. For just another 4 ms of investment you can read the whole track. Conclusion: having invested the seek time, you might as well go ahead and read the whole track on the chance that the data might be useful. Reading more than one track is where the total service time curve starts to climb steeply again, so it is probably best to stop with one. This paper picks up on this property of disks by arranging things to increase the probability that that the whole track will be useful.

Another requirement for this trick to work: RAM space to buffer the entire track must be cheap, too. How big is a track, anyway, and what will it cost to store it in RAM? (The numbers in table I almost seem to be trying to hide the answer. Assuming a "sector" is 512 bytes--as suggested on page 2--a track seems to consist of between 64K and 128K bytes in the examples cited. RAM is running about $6/Mbyte right now, so one must devote 60 cents worth of RAM to buffer an entire track.)

This point can be illustrated by by turning the curves of figure 2 on their side and looking at them in a mirror so that the service time is the X-coordinate and the amount of data returned is the Y-coordinate. It is then much more apparent that, at least to start, a small additional investment gets a large yield in volume of data read.

The curve is smooth and it continues rising, though more slowly, even after one track. But there is a special case surrounding reading exactly one track: the opportunity to do what is called "zero-latency read" (or write). The idea is that if you have a RAM buffer large enough for the whole track, and think of that buffer as circular, then you can begin reading or writing as soon as the seek has completed--there is no need to wait for any special rotation position to begin data transfer. (Actually, you can't start just anywhere; you must start at the next sector position that comes by, but a sector start is guaranteed to come by in less than 1% of a rotation time.) Zero-latency reading gives us a completely flat segment in the return-on-investment curve (between half a track and a full track). Linear programming tells us that it will be better to operate at one end of the flat segment; intuition tells us it is the full-track end. Greg also points out that,

...this efficiency will definitely come with a considerable complexity cost due to information-hiding disk interfaces (e.g., SCSI) and modern data layout techniques (e.g., zoned-bit recording).

Once you have that picture in mind, then the design question for the file system designer is "OK, the latency tradeoff for the disk is at its most effective point when it delivers track-sized blocks. How can I best organize my file system to take advantage of that?" The answer should change with technology, as the size of a track increases. When Multics and UNIX were designed (1960's), we were reading track-sized blocks--but they were only 500 to 1000 bytes long. Tracks are now 100 times larger, but many file systems still read blocks whose sizes where chosen in the 1960's--they haven't been redesigned. So now it is time to regroup. (FFS represents a redesign at the 10X point. LFS is an alternative redesign at the 100X point, but it only managed to group things so that writes are large. C-FFS is a different redesign that tries to group things so that both reads and writes are large.)

It may provide some insight to decompose the high-level mapping design process into two steps:

Disk bandwidth is scarce, and certain parameter settings make better use of it than others.
For those good parameter settings, how can we cleverly organize the data or the file system to take advantage?

Step one leads to a conclusion that transmitting large blocks is good. Step two tells you to organize things so that you will find lots of stuff you needed in those large blocks. That way, if the technology changes and different parameter settings become preferable, you still follow the two-step procedure. The two step procedure doesn't guarantee that you get an optimum, but it may be better than being completely ad hoc.

Putting the inode in the directory entry is the way the Multics file system did it in 1965. UNIX, in 1970, demonstrated an advance over Multics: there is a good reason for thinking of the inode as a logically separate thing. For simplicity, UNIX implemented the inode as a physically separate thing. The C-FFS redesign maintains the logical separation while eliminating the physical separation, at some cost in complexity.

If you were told that you must always read and write blocks of 30 to 200 Kbytes at a time, and that the block size you are required to use will depend on the particular disk being used, how would you design the file system? Embedding inodes in the directory, grouping small files that you think are related, and placing small files next to their directories would come to mind, along with other strategies, such as using (track, offset, length) triples as the way of locating small files, and perhaps taking the opportunity to garbage collect or reorganize the contents of a track any time it is in RAM. The possibility of designing a corresponding high-performance segmented one-level store, as in Multics and AIX, is also quite promising. Whether or not any these ideas is useful, this observation may be the real value of the paper.

The primary colocation idea tried out here is that if files are near each other in the directory hierarchy they may be read at about the same time. What other colocation ideas might be worth trying? What are the requirements on any colocation strategy? (Should be simple to implement. Should be easy for the customer to understand. Shouldn't badly degrade performance for applicatons that don't happen to follow the hoped-for pattern.)

Comments and suggestions: Saltzer@mit.edu