Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. ACM Thirteenth Symposium on Operating Systems Principles, in Operating Systems Review 25, 5 (December, 1991) pages 1-15. Also appeared in ACM Transactions on Computer Systems 10, 1 (February, 1992) pages 26-52.
by J. H. Saltzer, April 17, 1996, with suggestions from Liuba Shrira. Several suggestions of Steve Ward added, April 1998. Minor updates, April 2001, April 2003, and April 2004.
In order to talk about LFS, it is important first to review UNIX inodes, and how they separate naming from file management. LFS also uses inodes but it manages them a bit differently, so first you have to know how they are managed in a regular UNIX file system. A good exercise is to look up the file named /etc/passwd assuming an ordinary UNIX file system. The key idea is that each disk partition that contains a file system begins with a fixed-size array containing inodes. Usually, about 10% of the disk capacity is invested in this array. Appendix 5-A of the class notes describes how the lookup actually works, at the end of subsection 2.
This paper was written 10 years ago. Is the first sentence still applicable? (Yes, if anything things have gotten worse. Seek and disk rotation times have hardly changed since 1992, but CPU's are something like 40 times faster. 50 Mhz CPU's are now 2 Ghz CPU's.)
How about the first sentence of the second paragraph, about memory sizes? (Yes, memories have gotten cheaper, and have grown in size by a similar factor. A desktop computer with a Gigabyte or more of RAM is commonplace today.)
What's the real motivation behind LFS?
LFS exploits this cache-driven shift in read/write ratio by eliminating seeks on many writes.
LFS can be understood best by considering the log aspects independently from the garbage collection aspects. First, assume an infinite disk. Show how to find inodes, to maintain an end-of-log pointer, to write a new file followed by its inode, and construct an inode map in RAM. Also, show how to recover from a system crash in which RAM is lost, first without, and then with the help of, checkpoints.
[If section 4 of the paper was assigned, this is a good time to insert questions L1 through L9, on log-based crash recovery. If that section is assigned in a later reading, then these questions can be picked up at that time.]
Finally, (nice lead in to System R) it is nice to bring up the "ancient" quote from Jim Gray about why he would never put a file system between his DBS and the raw disk. He said that the number of disks access can be no more nor less than 3. Less than three implies that it is a toy system (need a nested naming scheme and access to the file itself), and if the FS required more than 3, it was in his way because he could do it in 3. Since disk accesses were the major delays (at least until they got involved in the network, with R*), minimizing disk accesses was critical.
(From: Hari Balakrishnan For most real workloads, it isn't that much better than UNIX's Extent-based File System(s) (EFS) and its variants. Under some workloads, it's actually worse.
http://nms.lcs.mit.edu/~hari/papers/efs_stuff.html has some info on comparisons from a project some other students and I worked on under Margo Seltzer's supervision many years ago.
I know it exists in various BSD releases. I don't believe it's used much.
--------------
My
understanding is that they never really got the garbage collection
part working well. It's really hard to tune those policies for usage
models. As is frequently the case it was all or part of a thesis, so
production code was not the objective.
That said there are a number of journalling file systems. I believe
one of the more interesting contributions to this is a new version of
ReiserFS. I'm not an NT user, but my understanding is that there is
also a journalling file system in a recent version of NT. In my
section we also talked a little about the differences between
journalling FS's and the purer form in LFS and some of the tradeoffs.
--------------
I also suspect that all of these various efforts from ten years ago
have to be reviewed in the
light of modern day realities of much larger disk drives, wireless
connections, larger memories, faster computers, and so on.
--------------
Another factor negating the need for LFS and journaling file system is
that now another solution than LFS and journaling file system exists
that doesn't require a complete overhaul of the file system layout.
Ganger's soft updates deals with the problems identified in the LFS
problems well but doesn't require cleaners and doesn't require a new
file system layout. The BSD operating systems ship now with soft
updates, and that is what many sites use (e.g., we in PDOS do).
See http://www.ece.cmu.edu/~ganger/smallfiles.html for more details.
This page is slightly out of date, since the detailed study with WAL
exists now.
--------------)
The following items are related to log-based crash recovery, and go well
just before diving into garbage collection, if part 4 of the paper has
been read.
L1. This log stuff all sounds great as long as we don't crash and lose
our volatile storage. What happens when RAM is reset? Don't we lose the inode table? (Yes, but it can be reconstructed. Scan the log from the beginning to rebuild the inode table. This scan is called roll-forward, and it requires that the inode blocks be identifiable.)
L2. That will take forever. Can't we do better? (If we can recover
the end-of-log pointer then we can scan backwards instead of forwards.)
L3. How is that better? (Because we can stop scanning as soon as the inode
table is completely filled in.)
L4. Still looks like it could take a long time. Suppose the first file we wrote has never been rewritten. Remember the disk is
infinite. How can we shortcut the need to scan everything previously
written on the whole disk? (Every once in a while write the current inode map
on the disk.)
L5. This thing written occasionally has a special name.
What is it? (A checkpoint.)
L6. How does it help to have a checkpoint? (Now, if we lose RAM, we
can recover the inode table by
L7. How do you know you have come to the end of the log?
If there was a crash, the last write may have been aborted
in mid-stream and there may be some random collection of old
and new data out there. (The paper is silent on
this point except to say that after writing anything to the disk, LFS also
writes a "summary block". This summary block is the same
one that is used in segment cleaning. The summary block consists of
flags that tell which of the blocks of the segment
are data, which are inodes, and which are empty. Although the paper
doesn't mention it, this block also contains a timestamp
and a checksum of the data that was just written, so that a later
reader can identify and ignore a summary block if there was a crash in
mid-write. So the problem reduces to finding the latest summary block.
Here is Frans Kaashoek's analysis:
L8. All we have to do now is figure out where the last checkpoint was.
How? (When it writes a checkpoint in the log, LFS makes a list of the disk sector numbers into which it wrote. The paper calls this list a "checkpoint region". Then LFS writes the checkpoint region to a fixed, well-known location on the disk.)
L8. What if we crash while writing the
checkpoint region and leave a muddled mess in the fixed location?
(We could do the semi-infinite scan to reconstruct the inode
map. What LFS does is reserve two fixed locations on the
disk. Read both regions, choose the one with the smaller
timestamp, and write the checkpoint followed by a new
time-stamp. Then, if you have to recover from a crash, read
both checkpoint regions and use the one with the bigger time-stamp.
)
L9. A crash while writing the time-stamp could make it look bigger than
it should be, couldn't it? (Sure, but since the time-stamp is written
last, the data value giving the sector addresses of the checkpoint region have all been written safely, so it doesn't matter! This is an example of a recoverable put from
chapter 8.N.B. It is a variant of the one first used in the American
Airlines SABRE reservation system back in 1961.)
L10. (From Steve Ward) The paper observes that LFS differs from
transaction logs primarily
in that the latter systems augment the logs with separate storage
of actual data. What justifies this difference?
[Simple answers involve different access patterns and overheads
associated with variable vs file access. However, it is plausible
that an efficient write ahead log could exploit an LFS-like
architecture, accessing data directly from RAM-buffered log pages.]
Subject: Re: How widely used is LFS?
From: "Karen R. Sollins"
Subject: Re: How widely used is LFS?
From: "Berthold K.P. Horn"
Subject: Re: How widely used is LFS?
From: Frans Kaashoek
Subject: Re: How widely used is LFS?
Here is my understanding. Updating the segment has two cases:
1. The complete-segment write. It is easy: write out the data into the
segment, followed by a write updating the segment summary block.
2. The partial-segment write. It has to be done carefully.
Lets say you start with an empty segment, which has a segment summary
block at its end. The summary block says that this segment is empty.
It is in a known fixed location (at the end of the segment).
It is easiest to understand partial writes by drawing a picture of
the segment:
--------------------------------------------------X
The empty segment with the segment summary block (X), representing
the end of the log. X is at the end of the segment in a well-known
location.
A partial write first calculates what the end of the log will be,
*backwards* from the current end of log (i.e., when the segment is
empty, it is the segment summary block at the end of the segment, in
the well-known location). At that position, the write writes a new
segment summary block representing the end of the log; then it writes
all the data, at which point the head is at the end of segment, just
before the segment summary block at the end of the segment. next it
updates that segment summary block, including a pointer to the new
end of the log.
After this partial write, the segment looks like:
-------------------------------YDDDDDDDDDDDDDDDDDDX'
Y is the new end of the log (the first segment summary block
written), D the data, followed by the updated segment summary block
X, which contains among other things a pointer to Y.
The next partial write just the same thing. It computes the new end
of the log, backwards from the current of the end of the log (the
first segment summary block written in the previous partial write).
Then, it continues the same procedure as before. Etc.
After the second partial write, the segment looks like:
---------------------ZDDDDDDDDDDDY'DDDDDDDDDDDDDDDDDDX'
Z is the new end of the log (the first segment summary block
written), D the data of the partial write, followed by the updated
segment summary block Y, which contains a pointer to Z. Etc.
Note that Y becomes the new end of the log only after X has been
updated, which is the right commit point.
Note that from the segment summary X (in the well-known location)
one can find the end of the log.
This design requires one disk seek for each partial segment write, to move the write head back to the place where writing should begin, and it leaves the seek head at the end of the prior summary block. Most other proposals seem to require two seeks per partial segment write.)
Comments and suggestions: Saltzer@mit.edu