6.033--Computer System Engineering
Suggestions for classroom discussion of:
Jim [N.] Gray, Paul McJones, Mike Blasgen, Bruce G. Lindsay, Raymond
[A.] Lorie, Tom Price, Franco Putzolu, and Irving [L.] Traiger. The
recovery manager of the System R database manager. ACM Computing
Surveys 13, 2 (June, 1981) pages 223-242.
by J. H. Saltzer, April 30, 1996, from 1982 recitation notes and 1984
teaching staff notes. Updated 1998, May 2001. April 2002, and May 2003.
(Reminder: appendix 8-C of the notes is a guide to reading this paper.)
System R consists of two parts, the data management system and the
underlying storage system, called RSS. This paper examines in detail only the
storage system, but the requirements on RSS are imposed by the higher layer
data management system and aren't always explicitly mentioned in the paper.
Why is RSS so very complex as compared with the atomicity
algorithms described in chapter 8 and in the lectures? (Performance is one
prime reason. To make a high-performance storage system, the system R
designers used disk buffering with lazy write strategies. But since RAM
buffers are volatile, the buffering interacts fiercely with the atomicity
algorithms. The write-ahead log protocol requires careful control of the order in
which writes touch the disk, and this control is hard to exert through a
buffer cache. Page-level locking--described in another paper--is another
performance enhancement that increases complexity.)
Another reason is that there are several simultaneous goals. What
are they?:
(- archiving: keeping old data values in case the user needs to look at
them again.
- recoverability: the system may crash, and invidividual in-progress
transactions may abort, but committed transactions are always remembered.
- storage modularity: data may cross page boundaries.
- storage efficiency: pages may contain data of different transactions.
- coordination and isolation: the locks aren't mentioned in this
paper, but they must have complicated the design.)
A third reason has to do with a change of goals. What changed?
(Early on, their model was that data should survive crashes, so
they focused on implementing recoverable *objects* and invented shadow pages
to help make this happen. Later, they realized that it is also important to provide consistency between different objects, which requires recoverable *transactions*. So they added the log. In making that addition, they didn't rip out the shadow pages, instead they designed a log system that took advantage of the fact that the shadow pages were already there. So RSS ended up with
both mechanisms.)
Yet another view is that there are several interacting mechanisms.
List the mechanisms:
( - Shadowed files
- Log copy of new/old record values [the EXOR log was a later idea]
- Log record of transaction start/end/save
- Log record of system checkpoint
- archive copy from disk -> tape
- RSS actions are individually atomic
- Log is duplexed)
Let's work through these mechanisms one at at time. What do they
mean by "shadow file"? Is this like the shadows of the RAID system?
(No. The RAID people used the DEC definition of "shadow", which means
replication. IBM calls replication "duplexing". IBM uses the term
"shadowing" for the idea, described in chapter 8, of "work on a copy of
the data rather than the original."
RAID DEC IBM
multiple, identical
copies mirror shadow duplex
modify a copy, not
the original shadow
unit of disk
storage block page
How do shadow copies work? (When a file is opened for writing, its
catalog entry is read from the disk into primary memory. The primary
memory copy of the catalog entry is expanded slightly to contain not one,
but two pointers to the page map that shows where on disk each page of the
file is actually stored. Until the instant when the file is actually
written, both "current" and "backup" point to the same page map. When the
first write occurs, the system allocates a second page map, filled with the
same file block pointers, and places a pointer to the new page map in
"current". It also allocates a new page, writes the data to that new page,
and places a pointer to the new page in the "current" page map. FILE SAVE
flushes any unwritten pages from the disk buffer, writes the second page map to disk, writes the directory to disk, sets backup to current,
and releases any file blocks that appear only in what was the backup page
map. FILE RESTORE sets current to backup and releases any file blocks
found only in what was the current page map.)
Suppose that we crash before doing the FILE SAVE step? Could we
use the "current" page map to get to the latest versions of the data?
(No--because there is no assurance that the data mentioned in the current
map is actually on the disk--it may have never gotten beyond the RAM buffer cache.)
The shadow copy is for a whole file at a time. What if two
transactions T1 and T2 both are writing data to different pages of the
same file? When a FILE SAVE occurs don't both transactions get
committed? (No. FILE SAVE has nothing to do with committing the
transaction, it is done only to coordinate with checkpoints as part of
crash recovery, not for transaction consistency.)
Then when is a transaction committed? (When it writes a commit
record in the log. Until that time, no one else can see its data--locks
prevent this, but the paper doesn't talk about them--and if the
transaction aborts its changes will be backed out by UNDOing the transaction unless we crash before the next SAVE following Commit. We need to start talking about how the log works.)
It says the log is duplexed. What does that mean? (See definition
above. They make two copies. This is a simple redundancy measure; two
copies proved adequate.)
But what threat does duplexing the log counter? (hardware decay
failures; the goal is to force copies onto different physical drives.)
Doesn't it also help with software failures? (Not really, because
independence is needed, and software errors will probably affect both
copies.)
Doesn't the log also provide archiving? (It doesn't provide
archiving by itself, but it helps in the archiving process.
there is a distinct archive mechanism, and the log's primary purpose is to
be the transaction atomicity mechanism. This difference is one of the
keys to untangling much of what is going on in System R.)
How far back does the log go? (Long enough to include
- every incomplete transaction
- the latest checkpoint
- the latest time when data was archived to tape
The authors describe the log as "on-line", meaning that because it
doesn't go back to the beginning of recorded history, they can keep it in
a file that is on the disk. Some systems have logs that go back forever,
so the beginning of the log is usually on a series of tapes, dismounted
long ago, with only the most recent log entries available on disk. When
you see a log like that, the designers are probably using the log as an
archive.)
This model suggests that System R distinguishes among three
apparently similar things: Archiving, Replication, Journaling. What
are these functions, more precisely?
(
- Archiving:
- keep a record of old data values, for future auditing
- Replication:
- fight back against hardware decay of current data values
- Journaling:
- group updates to different records into recoverable transactions)
Doesn't log replication also provide extra read performance and
convenience of access? (For things that are read a lot, this is
certainly a possibility. But if no crashes occur and no transactions
abort, noone ever reads the log, so read performance improvements seem
uninteresting. So system R doesn't pick up on this opportunity.)
What is a "write-ahead log"? (The protocol is that you always put
a record of proposed changes to data in the log before making those
changes to the data itself.)
What good does that do? (It is part of the pre-commit atomicity
mechanism: make sure that you can back out. It guarantees that all
modifications to data can be discovered and undone, if necessary, by
examining the log. If the data were modified first, and a crash occurs
before the log is written, those data changes could never be discovered
to decide whether or not they should be retained. Chances are they are
part of an incomplete transaction that should be backed out.)
How can System R use a WAL protocol given that it has an
LRU-managed disk buffer pool? If the LRU algorithm decides when to
write a particular page out, how can anyone be sure that the log gets
written to disk before the data? (The shadow file system provides
the missing, precise control. If the LRU buffer pool decides to
write a new page out, that affects only the "current" copy, which
will be automatically discarded if the system crashes. Only when
someone does a SAVE_FILE does the data become visible. So the WAL
protocol in system R becomes "make sure the log page that mentions a
file update is written to disk before doing a SAVE_FILE on that
file." The paper doesn't say much about log writing, but it does
talk about forcing the log, which means writing any buffered log pages
to disk.)
How is the log implemented? (This is one of the
things that is hard to read out of the paper. Parts of the mechanics
are described in four different places, and some things are left for
speculation. Some of the following comes from reading between the lines,
and may be wrong. But it is all consistent with what the paper says.
- They allocate a separate disk partition--called an "extent"--to
hold the log. (p. 240) If duplexed, there are two such partitions.
- There is a log manager program that directly reads and writes to
this partition; the log is not a file in the file system.
- The log partition is managed as a big ring buffer of disk sectors.
(p. 236) The next log page to be written goes into the next sector
of the partition.
- There is a single RAM buffer for the log; that is where things are
"written to the log". (p. 232) When the buffer fills, it is written to
the disk into the current log sector, the buffer is cleared, and the
record of the current log sector advanced by one. When checkpoints and
commit records are written, the current buffer is simply written to the disk in the current log sector. This is what "forcing the log" means.
Does System R use a backout-list or an intentions-list logging
protocol? (Yes. It uses a backout-list protocol for things before
the latest checkpoint and an intentions-list protocol for things
after the latest checkpoint.)
Suppose that a transaction decides that it has gotten into
trouble--perhaps the client looked at the available hotel room and
decided to cancel the trip--how does it UNDO? (go to the log, follow
the chain of events for this transaction backward, undoing each action.
When you get to the START record, the effects of the transaction are
completely undone.)
Suppose instead that the system crashes. How is atomicity of
transactions guaranteed?
(The key observation is that since System R always writes transaction data
to shadow files, those writes always go to a copy that doesn't become
permanent until someone does a FILESAVE. If a crash occurs after a write
but before a FILESAVE, it is as if that write never happened.)
Now consider four different ways System R could be organized to take
advantage of this property of shadow files.
First, suppose no one ever calls FILESAVE. What would things look like, and how would recovery go?
(In that case, the system might
run all day, logging planned writes and then doing those writes, and then
logging commits. Then, a crash occurs. All writes are lost; the
disk looks just like it did when the system started--except for the log,
which is not a shadowed file, it is in a partition of its own. But since
the log contains all the writes and commit records, we can recover by
REDO'ing every committed transaction.)
When a crash occurs, how can we identify the committed transactions?
(Every commit includes forcing the log to disk, so that
data is safely stored. So we start at the last written page
of the log and work backwards looking for commit records. When we get to
the front of the log we have a complete list of all commit records. Then
we can scan the log forward, and make sure that every data change
belonging to a committed transaction actually has been made to the \corresponding file as seen on the disk.)
Second, suppose that System R watches, and whenever it notices that things
are quiet and there are no active transactions (every transaction that has
begun has committed or aborted) it takes the opportunity to perform a
FILESAVE on every open, shadowed file. Then, sometime later, a crash occurs.
(It now needs to REDO only transactions that committed after the FILESAVE.
The log from before that point can be discarded.)
Third, suppose instead that system R performs a FILESAVE (again, on every
open, shadowed file) every time anyone commits any transaction. Then, after some more work, there is a crash.
(All writes up to that commit are permanent, even writes for
uncommitted transactions that happened to be in progress at the time the
most recent transaction committed. All writes after that commit are as if
they had never happened. From this state, we can recover by UNDO'ing the
logged writes of every transaction that was pending at the time of the last
commit.)
Last (this is what system R actually does) suppose that the only time
anyone every issues a FILESAVE is at a periodic checkpoint, say every ten
minutes, and we make a log entry to show when the checkpoint was taken.
(If
a crash occurs after a checkpoint, all writes up to the checkpoint are
permanent (perhaps including some writes for transactions that never got to
commit), all writes after the checkpoint are lost (perhaps including some
writes for transactions that did commit). So the recovery procedure is to
REDO any writes after the checkpoint that belong to committed transactions,
and UNDO any writes before the checkpoint that belong to losers.)
At this point it is appropriate to sketch a log
scenario on the board such as the one of figure 11 on page 235 of
the paper, with the addition of
pointers to show the status of shadow versus current pages for various
files that the five transactions may happen to have open. For an
example, there is
a scan of a nearly unintelligible hand-drawn figure. (Karen Sollins contributed easier-to-read
Acrobat and
PowerPoint versions of most of that same figure.
The idea is to examine each of the transactions T1 - T5 in turn to see how
it is affected by a system crash at the point indicated. Those five
transactions include every possible combination of {completed, incomplete}
at crash time and {preceding, spanning, following} the latest checkpoint.)
How about a transaction that
aborted just before the crash? It has recorded an ABORT
record in the log but its UNDO's are trapped in the buffer
cache. (While doing the log scan we should have made a list
of them, too, and made sure that every change they made has
actually been undone. This set of COMMITed and ABORTed
transactions is called the list of "winners".)
Another transaction may have written several changes to the log, and
also done FILESAVE on those changes, but not yet committed at the time of
the crash. What about those transactions? (During our backward log scan
we also made a list of "losers" and performed UNDO on any modifications
they made.)
How are losers discovered? (The first evidence of their existence
during the LIFO log scan is a MODIFY record in the log, rather than a
COMMIT or ABORT record.)
So what is a checkpoint? (A note in the log intended to help reduce
the distance back you have to scan.)
What is the simplest checkpoint contents? (A note saying "Hey,
look. There aren't any outstanding transactions right now so I just
flushed the disk buffer cache and did a FILE_SAVE on all open files.)
Why does that limit the distance back that we have to review the
log? (Because when you come across such a note, you know that your list
of winners and losers is complete. There can't be any more losers,
because when the checkpoint was written there weren't any outstanding
transactions. There is no need to look for more winners because we are
assured that all the data of older winners--both committed and
aborted--is safely on the disk in shadow = current versions.)
So why don't we write checkpoints frequently? (We may go a long
time before we see an instant with no transactions in progress.)
How does system R deal with that problem? (By writing hairier
checkpoints. 1. Write a note on the log giving a list of all
outstanding transactions. 2. Do a FILE SAVE on everything in sight.
3. Pray that we don't crash halfway through doing the checkpoint.)
Why does that help? (When the LIFO log review comes to this
checkpoint record it can immediately complete its list of winners and
losers--any transaction it hasn't run across later in the log it adds
to its list of losers. Then it has to continue scanning till it finds
the START record of the earliest-starting loser, but that is as far as it has
to go back.)
What about the shadow state of files? (With checkpoints, there is a
subtle modification to the protocol. The *only* FILESAVE operations to
data files are done by the checkpoint routine. The log is forced to disk both
at checkpoint time and also whenever a commit record is written.)
How does that help? (A simplification--one of the few in System
R--anything shown as modified after the checkpoint has already been
UNDONE by the change of backup -> current. So we only need REDO
winners' modifications that are logged following the checkpoint.
Anything logged as modified before the checkpoint was recorded by the
change of current -> backup that was done by the FILESAVE at checkpoint
time. So we need to UNDO only losers' modifications that were logged
before the checkpoint.)
What is this about praying that we don't crash halfway through a checkpoint? These guys seem to be pretty smart; didn't they come up with a more systematic method that doesn't involve prayer? (Yes, but you have to read between the lines to discover it. Checkpointing really comprises
- write a checkpoint record to the log.
- force the log to disk
- force all shadow page maps to disk
- force to disk new directory entries that refer to the new page maps
- force to disk a new copy of the file system directory, that contains these new directory entries.
- write a one-sector root record (that contains the pointer to the file system directory and the pointer to the log address of the checkpoint) twice, using a recoverable-put algorithm like the one described in chapter 8.
If there is a crash any time before the last step completes, when the
system comes back up it will find the original root record, and none of
the stuff written in the previous steps will be visible, so it looks as
if the entire checkpoint hasn't taken place yet.
So at the end it all comes down to a careful-put of a single sector as
the last step of the recoverable-put.
All this comes from studying a three-sentence explanation on page
234. (Thanks to John Chapin for sorting it out.))
(From Seth Teller) The System R paper doesn't mention file
open or file close operations, and thus it doesn't say anything about
the relation between those operations and SAVE_FILE. In particular,
does closing a file force a SAVE_FILE? If it did, that might
interact with the recovery system. (File open and file close
operations occur when the application starts in the morning or shuts
down for the day, and all files of interest to the application simply
remain open during normal operation of the system. Thus file close won't
happen until all transactions have been recorded, saved, and committed.
This is a database system, not a file system.)
Can you find any examples of fail-fast design in System R? (Page
226: "System R initiates crash and restart whenever it detects damage to
its data structures". Page 227: "We postulate that these errors are
detected and that the system crashes before the data are seriously
corrupted".)
Page 227 invokes Murphy's law. What is Murphy's law? (Something
different from what these authors think. See sidebar in chapter 1, appendix A.)
Section 1.1 says that each RSS action is atomic. How do they
achieve atomicity of these actions? (The paper doesn't tell us. We
have to take it on faith. Or read between the lines, as in the description of how the checkpoint action is atomic.)
How does RDS achieve atomicity? (The RSS recovery functions BEGIN,
ABORT, COMMIT, UNDO, together with write logging are tools to be used
by RDS to achieve atomicity of transactions implemented by RDS. We
have a two-layer atomicity model here.)
Section 2.9, on Media Failure, says that archive copies are made by
copying the data on disk to tape. What makes this such a complicated
job? (The goal is to get a consistent snapshot, so the copy to tape
should be made atomically with respect to current activity.)
Here is a simpler method: wait till an instant when there aren't
any transactions outstanding, start copying to tape, and refuse to start any
new transactions till the tape is complete. Why didn't they choose that
technique? (You may wait forever for a time when the system is
quiescent. And even if you find one, the time required to make the copy
to tape is too long to delay start of new transactions.)
So what do they do instead? (Make a "fuzzy dump" to the tape without
quiescing the system. Dump the current log to tape also. Later, use the
log copy to "focus" the tape copy. Gray says in another paper, "the details
of the algorithm are left as an exercise for the reader." Presumably the
exercise consists of choosing a serialization instant for the dump, and
making sure that the log contains the outcome of every transaction what
wasn't yet committed or aborted at that instant. At restore time, one
can run the log forward and complete or abort those transactions that
committed or aborted before the serialization instant.)
Does this exercise result in the equivalent of an atomic dump?
(Yes. It is as if the dump had been done atomically at the chosen
serialization instant. All transactions that committed or aborted
before that time are represented correctly in the restored copy; any
transactions that committed or aborted after that time do not exist.)
Although not mentioned anywhere in the paper, System R raises an
interesting trade-off among security, privacy, and reliability. Suppose
you are managing a database using System R, and the FBI comes to you with a
report that one of your employees has recorded a state secret--the
president's cell-phone number--in your database. They want you to delete
it. You initiate a transaction that modifies the database, and the
transaction commits. Is the information gone? (Customers who enter the
database through its RDS interface won't see the secret data--unless the
system crashes, damages the disk, and has to be restored from last week's
archive tape. In addition, a system wizard intent on doing so can almost
certainly locate that phone number. The shadow page with the old data is
probably still on the disk somewhere, until it happens to be reallocated to
some other purpose and rewritten; there are two copies of the log on the
disk, each containing old and new data values, the log has been written
onto the most recent archive tape, and all of the old archive tapes still
contain the secret data. A typical policy on archive tapes is to keep one
each month for a year, and one each year forever. So you may never be able
to assure the FBI that the information is really gone. It can be very hard to
really expunge data in a system that employs redundancy. System R is designed not to forget things.)
The assignment question from 1992: When system R takes a system
checkpoint it waits until the system state is action consistent but not
transaction consistent. Discuss briefly the reasons for this decision
and its implications for recovery.
There are several very useful questions that might be used to spur
discussion on pages 7-101 and 7-102. The following two additional
discussion questions were suggested by Eric McDonald:
1) System R possesses the property of "consistency," essentially
transforming concurrent transactions on a shared object (i.e. a file)
into a serial set of transactions. Discuss the reasoning behind this
decision. Describe a situation involving concurrency and shared
objects where System R could cause problems if it didn't serialize
concurrent transactions on shared objects.
(2) In section 3.9, the authors describe their failure to include a
recovery system for output messages as a "major design error." Do
you think they're being a little hard on themselves? Try to identify,
by giving two significantly different real-world examples, why they're
so concerned with message recovery.
Comments and suggestions: Saltzer@mit.edu