6.033 Discussion Suggestions (LFS paper)

6.033--Computer System Engineering

Suggestions for classroom discussion of:

Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. ACM Thirteenth Symposium on Operating Systems Principles, in Operating Systems Review 25, 5 (December, 1991) pages 1-15. Also appeared in ACM Transactions on Computer Systems 10, 1 (February, 1992) pages 26-52.

by J. H. Saltzer, April 17, 1996, with suggestions from Liuba Shrira. Several suggestions of Steve Ward added, April 1998. Minor updates, April 2001, April 2003, and April 2004.

In order to talk about LFS, it is important first to review UNIX inodes, and how they separate naming from file management. LFS also uses inodes but it manages them a bit differently, so first you have to know how they are managed in a regular UNIX file system. A good exercise is to look up the file named /etc/passwd assuming an ordinary UNIX file system. The key idea is that each disk partition that contains a file system begins with a fixed-size array containing inodes. Usually, about 10% of the disk capacity is invested in this array. Appendix 5-A of the class notes describes how the lookup actually works, at the end of subsection 2.

This paper was written 10 years ago. Is the first sentence still applicable? (Yes, if anything things have gotten worse. Seek and disk rotation times have hardly changed since 1992, but CPU's are something like 40 times faster. 50 Mhz CPU's are now 2 Ghz CPU's.)

How about the first sentence of the second paragraph, about memory sizes? (Yes, memories have gotten cheaper, and have grown in size by a similar factor. A desktop computer with a Gigabyte or more of RAM is commonplace today.)

What's the real motivation behind LFS?

RAM is cheap, so most operating systems now use a large disk buffer cache in RAM.
this large cache absorbs most disk reads.
But all disk writes must eventually go through to the disk, for durability
Consequence: the disk will see mostly writes.

LFS exploits this cache-driven shift in read/write ratio by eliminating seeks on many writes.

LFS can be understood best by considering the log aspects independently from the garbage collection aspects. First, assume an infinite disk. Show how to find inodes, to maintain an end-of-log pointer, to write a new file followed by its inode, and construct an inode map in RAM. Also, show how to recover from a system crash in which RAM is lost, first without, and then with the help of, checkpoints.

Chapter 8, section D, talks about logs. It says they have three uses. What are they? (1. Stability (holding a replica of cell storage). 2. Archive (remembering old values of data). 3. Making actions recoverable.)
LFS comes up with a fourth purpose. What is it? (4. Enhancing performance.)
Suppose we had available an infinite amount of disk. How would you organize the file system differently from the way it is done in UNIX? (Draw a strip across the board representing the disk as a linear storage device. To the left is stuff we have written in the past. To the right is empty space, never before used. In the middle is a pointer to the next place we can write. Respond to a write request by writing as many blocks of data as needed, in a contiguous region starting at the pointer. [N.B.: data is always written in sector-sized blocks, because a sector is the smallest unit of writing that the disk supports.])
What is the main virtue of this contiguity? (At most, only one seek is needed, then we can write all of the data. The bigger the chunks we write, the fewer seeks we will have to perform.)
Why "at most"? (Because if the last thing the disk did was write something, the seek arm is already in the right position. Remember that the big RAM buffer cache is absorbing most of the reads, so there is a good chance that a write will immediately follow another write.)
But who writes large quantities of stuff all at one time? Most applications do disk writes in small chunks, as they go. (The buffer cache manager, whether or not the user asked for it. The buffer cache manager may actually write several files at once, or blocks from several different files.)
But if I want to update a file, don't I have to seek back to where it is located and rewrite it? And what if it gets longer? (The basic idea of the log is that you never rewrite things, you always write a new version. So you never have to go back. Just write the new version of the changed data starting at the pointer. If it is bigger, no problem.)
So writing is easy. But if I want to read a file, how do I find where the latest version is located? (The Inode tells you where each block is.)
Where does LFS store the Inode? (Whenever it writes a batch of blocks for a file to the log, it writes a new, updated copy of the inode to the log immediately after the file blocks. That eliminates another seek at write time.)
But how do I find the latest version of the inode? (This is the interesting part. Keep a table, called the Inode Map, in RAM of {inode number, disk offset} pairs. If a directory says that you need the file described by inode 79, look in the Inode Map for the 79th entry. If an inode is rewritten, just place the new location of the inode in the Inode Map.)
[If section 4 of the paper was assigned, this is a good time to insert questions L1 through L9, on log-based crash recovery. If that section is assigned in a later reading, then these questions can be picked up at that time.]
So why did LFS choose to use UNIX inodes as the data structure to describe a file? As long as they were changing things around, why not tinker with that structure too, for example to add access control lists? (One of the points was to minimize the work required to slip LFS in under UNIX. By retaining the inode structure, the entire UNIX file system above the disk layout level can be used without touching it.)
The disk really isn't infinite. What happens when we run into the end? (Garbage collect, then move the pointer to the beginning of the free area.)
What kind of garbage collector does LFS use? Mark-sweep? Stop-and-Copy? Why? (It is a copying garbage collector. Mark-sweep produces a chain of released blocks, which might be fine if we wanted to use just one block at a time. For this application, chaining non-contiguous blocks doesn't produce free areas that allow seekless writing. Copying produces contiguous free space for new data and it also provides the option of gathering together previously separated blocks of a live file to reduce seeks while reading.)
How can we identify which files and inodes are live? (Live inodes have pointers in the current inode map. Live blocks have pointers in current inodes. So in principle there is enough information.)
LFS does something complicated called "segments". Why not just garbage collect the whole disk? (At garbage collection time you would have to copy every live file. Compacting the whole disk would take a very long time. In addition, some files are quite long-lived and it would be nice to take advantage of that property.)
How does LFS avoid garbage collecting the whole disk? (By dividing it into units called segments, and garbage collecting a few at a time.)
What idea dominates the determination of how big or small a segment should be? (The notion that the disk buffer manager normally writes a whole segment at once.)
What does that notion tell us about the maximum size of a segment? (It ought to be smaller than the disk buffer cache. Quite a bit smaller, so that when the cache manager decides it needs space it can have accumulated enough dirty pages to fill up a segment. We also should consider the added latency effect on reads that are posted during one of these long writes. The read--which presumably some client is actually held up waiting for--can't be handled until this write is finished.)
What does that notion tell us about the minimum size of a segment? (It ought to be large enough that the writing time is long compared with the seek time required to get to the beginning of the segment. That way we get to use most of the available bandwidth to the disk.)
So LFS divides a 2 Gbyte disk into 2,000 one-Mbyte segments. Suppose we are asked to examine the contents of one of these segments. How can we figure out which blocks in that segment are live and which are dead? Where do we start? (Check the segment summary block. It says, for each block in the segment, which inode it is associated with, and whether the block represents the inode itself or a block of the associated file.. The segment summary is written at the end of the segment as the last step in writing the segment.)
What segment size would you choose today, with a 200 Gbyte disk? (Since RAM is probably also 100 times larger, with a proportionately larger RAM disk buffer, 2,000 100 Megabyte segments seems plausible. But there is an incommensurate scaling concern: the data rate to the disk isn't 100 times faster, it is only 10 times faster than it was a decade ago. So our 100X larger writes are going to tie up the disk channel for ten times as long, and that could be a problem for user threads waiting to read. About the best we can do is pray that the larger disk buffer reduced the frequency of reads enough to compensate. The average may be OK, but the variation in read time may be a problem.)
Suppose the segment summary says that the block we are looking at is an inode. How can we tell if it is live? (Look in the inode map for the entry that corresponds to this inode number. If that entry points to this disk block, the inode is live. Otherwise the inode is dead and its space can be reclaimed.)
Can we also reclaim the blocks of the file that immediately precede this inode? (Not without looking further!)
Why not? (The reason a new inode was written to replace this old inode may be that the file was extended in length--or one block near the middle was rewritten. In either case the new inode, somewhere farther out in the log, may contain pointers back to these blocks of the file.)
So in order to tell whether the blocks are live, we have to look at the current inode. That involves a seek to wherever the inode currently is. Is there any shortcut to avoid these seeks? (Unfortunately, if these blocks are live, the only way we are going to find out is by inspecting the current inode. The best hope is that a copy of the inode is still in an inode cache in RAM.)
If the blocks are dead, there is a shortcut that helps avoid a seek in some cases. Suppose that the reason the inode was rewritten was that this file was actually deleted, and the inode has been reused for some other, completely different file. How can LFS avoid the seek in that case? (By keeping in the inode map a version number that is incremented each time an inode is reused for a completely new purpose. A copy of this version number also goes into the segment summary block. Now, if we are wondering about the liveness of a page, just compare the version number found in the segment summary block with the current version number in the inode map. If they are the same, we are going to have to look at the inode. But if they are different, we know this is a dead block and we can avoid the seek to the current inode.)
Where have we have seen the same idea used before? (In NFS, where they called it a generation number.)
(from Steve Ward) What's depicted in Figs 5, 6, and 10? (on pages 39 and 47) What is the significance of the labels on the vertical axis? [This is an example of poor presentation, and its a good bet that most of your students are confused by them. These charts show distribution of segment utilizations, and are histograms even though they look like x-vs-y graphs. The numbers on the vertical axis are thus pretty meaningless; they reflect the choice of bin size for the histograms and have no absolute significance.]
(from Steve Ward) Why the discontinuity in Fig 5 at around 55%? [The "greedy cleaner" cleans segments with lowest utilization, leading to a natural "cleaning threshold". The clustering of segments in the distribution just above this point is identified by the authors as a performance problem.]
(from Steve Ward) Why do the authors think this clustering is bad? [Because "cold" segments - ones holding long-lived data - tend to dominate this group. This thesis is not shown directly by the chart (again, a plausible presentation complaint) but inferred from the fact that clustering around the threshold is increased in simulations using their hot/cold locality model compared to simulations using uniform distribution of accesses.]
(from Steve Ward) Why the dual thresholds in the cost-benefit curve of Fig 6? [This is an artifact of the author's hot/cold model, in which 90% of accesses go to the "hot" 10% of files. Note that the arbitrary (and somewhat hokey) hot/cold distinction does not enter in to the cost- benefit heuristic which dictates the cleaning decision, which is a quite reasonable continuous model. More realistic distributions are shown in Fig 10, which uses real data.]
(from Steve Ward) The authors note the similarity of LFS to scavenging garbage collectors used, e.g., by the LISP community. They point out the difference that efficient random access is possible in the latter case, but not in LFS. Why? [This is a debatable point. This statement may presume that the GC'ed storage is in RAM. Of course, in interesting cases it occupies virtual memory, and is largely disk-based. The cost that sophisticated GCs are minimizing is thus disk access time, and primarily seek time -- much the same as LFS. I think the connection between LFS and temporal GC is even closer than the authors suggest.]
(From Karen Sollins) I found it helpful to discuss why one would make one or another engineering choice, and what some the problems were with the underlying assumptions of LFS. We are certainly in a different world than LFS was proposing. So, I found valuable issues to be:
1. Is the assumption of locality of reads (only going to the cache) reasonable?
2. Is the assumption about the mixture of files reasonable?
3. Are the access patterns of the mixture of files reasonable?
4. What happens to the model if you assume, as often happens now, that there are very good and large write caches? Suppose as part of the writing of the write cache that the writes are organized so they optimize a single scan through the disk, so once every 30 secs or minute or whatever, one does a scan from beginning to end of the disk making all the necessary changes?
5. How is all this affected if the machines are no longer timesharing systems, but single user workstations?
6. What is the impact of the changing relationship between speeds (all 3 aspects of speed) of disks and processors?
Finally, (nice lead in to System R) it is nice to bring up the "ancient" quote from Jim Gray about why he would never put a file system between his DBS and the raw disk. He said that the number of disks access can be no more nor less than 3. Less than three implies that it is a toy system (need a nested naming scheme and access to the file itself), and if the FS required more than 3, it was in his way because he could do it in 3. Since disk accesses were the major delays (at least until they got involved in the network, with R*), minimizing disk accesses was critical.
If LFS is so great, why doesn't everyone use it?
(From: Hari Balakrishnan
Subject: Re: How widely used is LFS?
For most real workloads, it isn't that much better than UNIX's Extent-based File System(s) (EFS) and its variants. Under some workloads, it's actually worse.
http://nms.lcs.mit.edu/~hari/papers/efs_stuff.html has some info on comparisons from a project some other students and I worked on under Margo Seltzer's supervision many years ago.
I know it exists in various BSD releases. I don't believe it's used much.
--------------
From: "Karen R. Sollins"
Subject: Re: How widely used is LFS?
My understanding is that they never really got the garbage collection part working well. It's really hard to tune those policies for usage models. As is frequently the case it was all or part of a thesis, so production code was not the objective.
That said there are a number of journalling file systems. I believe one of the more interesting contributions to this is a new version of ReiserFS. I'm not an NT user, but my understanding is that there is also a journalling file system in a recent version of NT. In my section we also talked a little about the differences between journalling FS's and the purer form in LFS and some of the tradeoffs.
--------------
From: "Berthold K.P. Horn"
Subject: Re: How widely used is LFS?
I also suspect that all of these various efforts from ten years ago have to be reviewed in the light of modern day realities of much larger disk drives, wireless connections, larger memories, faster computers, and so on.
--------------
From: Frans Kaashoek
Subject: Re: How widely used is LFS?
Another factor negating the need for LFS and journaling file system is that now another solution than LFS and journaling file system exists that doesn't require a complete overhaul of the file system layout. Ganger's soft updates deals with the problems identified in the LFS problems well but doesn't require cleaners and doesn't require a new file system layout. The BSD operating systems ship now with soft updates, and that is what many sites use (e.g., we in PDOS do).
See http://www.ece.cmu.edu/~ganger/smallfiles.html for more details. This page is slightly out of date, since the detailed study with WAL exists now.
--------------)
The following items are related to log-based crash recovery, and go well just before diving into garbage collection, if part 4 of the paper has been read.
L1. This log stuff all sounds great as long as we don't crash and lose our volatile storage. What happens when RAM is reset? Don't we lose the inode table? (Yes, but it can be reconstructed. Scan the log from the beginning to rebuild the inode table. This scan is called roll-forward, and it requires that the inode blocks be identifiable.)
L2. That will take forever. Can't we do better? (If we can recover the end-of-log pointer then we can scan backwards instead of forwards.)
L3. How is that better? (Because we can stop scanning as soon as the inode table is completely filled in.)
L4. Still looks like it could take a long time. Suppose the first file we wrote has never been rewritten. Remember the disk is infinite. How can we shortcut the need to scan everything previously written on the whole disk? (Every once in a while write the current inode map on the disk.)
L5. This thing written occasionally has a special name. What is it? (A checkpoint.)
L6. How does it help to have a checkpoint? (Now, if we lose RAM, we can recover the inode table by
- figure out where the last checkpoint was, read it, and use it as the basis of a new inode map.
- scan forward, adjusting or making new entries in the inode map for each inode we come across.
- stop when you come to the last inode written to the log.)
L7. How do you know you have come to the end of the log? If there was a crash, the last write may have been aborted in mid-stream and there may be some random collection of old and new data out there. (The paper is silent on this point except to say that after writing anything to the disk, LFS also writes a "summary block". This summary block is the same one that is used in segment cleaning. The summary block consists of flags that tell which of the blocks of the segment are data, which are inodes, and which are empty. Although the paper doesn't mention it, this block also contains a timestamp and a checksum of the data that was just written, so that a later reader can identify and ignore a summary block if there was a crash in mid-write. So the problem reduces to finding the latest summary block. Here is Frans Kaashoek's analysis:
```
Here is my understanding. Updating the segment has two cases:

1. The complete-segment write. It is easy: write out the data into the
segment, followed by a write updating the segment summary block.

2. The partial-segment write.  It has to be done carefully.

Lets say you start with an empty segment, which has a segment summary
block at its end. The summary block says that this segment is empty.
It is in a known fixed location (at the end of the segment).

 It is easiest to understand partial writes by drawing a picture of
 the segment:

    --------------------------------------------------X

  The empty segment with the segment summary block (X), representing
  the end of the log.  X is at the end of the segment in a well-known
  location.

 A partial write first calculates what the end of the log will be,
 *backwards* from the current end of log (i.e., when the segment is
 empty, it is the segment summary block at the end of the segment, in
 the well-known location).  At that position, the write writes a new
 segment summary block representing the end of the log; then it writes
 all the data, at which point the head is at the end of segment, just
 before the segment summary block at the end of the segment.  next it
 updates that segment summary block, including a pointer to the new
 end of the log.

 After this partial write, the segment looks like:

    -------------------------------YDDDDDDDDDDDDDDDDDDX'

  Y is the new end of the log (the first segment summary block
  written), D the data, followed by the updated segment summary block
  X, which contains among other things a pointer to Y.


 The next partial write just the same thing.  It computes the new end
 of the log, backwards from the current of the end of the log (the
 first segment summary block written in the previous partial write).
 Then, it continues the same procedure as before.  Etc.

 After the second partial write, the segment looks like:


  ---------------------ZDDDDDDDDDDDY'DDDDDDDDDDDDDDDDDDX'

  Z is the new end of the log (the first segment summary block
  written), D the data of the partial write, followed by the updated
  segment summary block Y, which contains a pointer to Z.  Etc.

  Note that Y becomes the new end of the log only after X has been
  updated, which is the right commit point.

  Note that from the segment summary X (in the well-known location)
  one can find the end of the log.
```
This design requires one disk seek for each partial segment write, to move the write head back to the place where writing should begin, and it leaves the seek head at the end of the prior summary block. Most other proposals seem to require two seeks per partial segment write.)
L8. All we have to do now is figure out where the last checkpoint was. How? (When it writes a checkpoint in the log, LFS makes a list of the disk sector numbers into which it wrote. The paper calls this list a "checkpoint region". Then LFS writes the checkpoint region to a fixed, well-known location on the disk.)
L8. What if we crash while writing the checkpoint region and leave a muddled mess in the fixed location? (We could do the semi-infinite scan to reconstruct the inode map. What LFS does is reserve two fixed locations on the disk. Read both regions, choose the one with the smaller timestamp, and write the checkpoint followed by a new time-stamp. Then, if you have to recover from a crash, read both checkpoint regions and use the one with the bigger time-stamp. )
L9. A crash while writing the time-stamp could make it look bigger than it should be, couldn't it? (Sure, but since the time-stamp is written last, the data value giving the sector addresses of the checkpoint region have all been written safely, so it doesn't matter! This is an example of a recoverable put from chapter 8.N.B. It is a variant of the one first used in the American Airlines SABRE reservation system back in 1961.)
L10. (From Steve Ward) The paper observes that LFS differs from transaction logs primarily in that the latter systems augment the logs with separate storage of actual data. What justifies this difference? [Simple answers involve different access patterns and overheads associated with variable vs file access. However, it is plausible that an efficient write ahead log could exploit an LFS-like architecture, accessing data directly from RAM-buffered log pages.]

Comments and suggestions: Saltzer@mit.edu