6.033 - Computer System Engineering | Handout 27, April 8, 1999 Revision of April 16, 1999, 11:30 a.m. |
Motivation
Cheap disk storage now makes it possible to keep an entire library on line, for easy accessibility over the Internet. But as the quantity of material grows, it is important to consider methods of maintaining integrity of the data. Traditional methods of backup, such as making a copy to a tape each week, may not scale up to the sizes needed, and are not particularly well suited to the very slow rate of change of the data in a library.
I. The Design Problem
The Library of Congress has decided to digitize all of its 105,000,000 objects and store them on line with an array of low-cost hard disks. The objects vary in size from a few bytes to several Gigabytes, but they have an average size of 1 Megabyte each. Using the latest Seagate 50 Gigabyte disks, the Library would need to purchase an array of about 2,000 disks to hold all of this material. A disk array that large would be stretching things for a single computer system, but since disk sizes double roughly every 18 months, by the time they get finished digitizing everything in 2005, they figure that only 125 disks will be needed; this is well within the size range for a single large server. They call a system that holds a complete copy of this data a replica site.
The real problem that the Library of Congress is concerned with is maintaining the integrity of that many bits. To help ensure that a hurricane, meteor strike, earthquake, or civil war followed by NATO bombing doesn't destroy the entire collection, they have contracted with a set of widely separated site operators---in Sydney (Australia), Novosibirsk (Russia), Capetown (South Africa), and Quito (Ecuador)---each to install appropriate hardware and maintain a replica site holding the entire collection, in addition to the Library's own replica site in Washington, D.C. They are satisfied that the probability of all the replica sites being destroyed at the same time is negligible.
The set of replica sites, plus an appropriate set of protocols and procedures, becomes the Repository---an archival storage system---for the digital library. Other library functions, such as creating bibliographic records and searching those records or searching full-text indexes of the library contents, are handled by other systems; those systems use the Repository by storing or retrieving objects by object identifier via the Internet.
The Library has asked your team to design the protocols and procedures to add objects to the Repository and to update and maintain the replica sites, giving consideration to at least the following specific problems:
You may assume that someone else has designed a simple file system that can gracefully deal with a very large number of files (the typical UNIX implementation would have some trouble keeping track of 105,000,000 files). Each replica site runs a copy of the file system. The file system provides exactly one directory per disk, each with a fixed name from the series /disk001/, /disk002/, etc. Below that the naming system is flat, which means that all of the files stored on the first disk are in the single directory named /disk001/.
This file system has the following methods in its interface:
status = create-file(directory,
metadata)
This creates a file consisting of the
given metadata structure, {filename,
size in bytes} and creates an
empty data space of the given size.
status will
be non-zero if there is not enough room for the file on this disk.
write-file(directory,
filename, data)
where data
is
assumed to be an array of bytes of the size that was specified in the metadata
at create-file time.
result = read-file(directory,
filename)
where result
is
a structure consisting of two components:
{file
data, file metadata}.
remove-file(directory, filename, data)
result = list-files(directory)
where result is
a structure of two components:
{number of files
in directory, array with one file name per element}
When the file system encounters a hard error (that is, an error for which retries don't help) on a disk, it flags that disk sector so that it will never be used again and it updates its record of the file contents to omit that sector. As a result, a read-file operation may deliver data that is missing one or more sectors, and a write-file operation may store an incomplete file. The file system uses files to store its own directories, but it maintains enough redundancy in its directory structure to reliably detect damaged or missing directory information. Thus it never returns or uses a partial or otherwise erroneous directory entry; the worst thing that happens is that one or more directory entries silently vanish.
Note that the file system delivers no warnings of missing data, incomplete writes, or vanishing directory entries. The reason is that, even though it may know about bad disk sectors, there are other failure cases that it cannot reliably detect. It is possible, for example, that an unnoticed hardware or software failure has led the file system to read the wrong set of disk tracks, or that the file system itself may have a bug and deliver data to its client in the wrong order. Using an end-to-end argument, its designers decided it was better to leave the problem of detecting errors in client data entirely to the client of the file system.
The file system is designed to work correctly with multiple threads. Each of these methods is isolated (that is, atomic with respect to parallel threads that invoke the same or other file system methods). In addition, create-file and remove-file are recoverable. That is, following a crash all directories are intact and a file that was in the midst of being created or removed will either be intact or gone. Write-file is not recoverable; the file system makes no attempt to reconstruct or recover file contents following a crash. On the other hand, crashes do not affect files that were not being written at the moment of the crash. The file system does not come with any backup or logging features, and there is no provision for security (that is, no access control lists, owners, or permission bits). The metadata of the file system is configurable, so if you would like to have the file system store any additional metadata, you should specify carefully what you need. You may discover other requirements or features for the file system. If they seem consistent with the general design philosophy so far, you can request that the file system designers add those features or meet those requirements.
III. Other Considerations
One property of large research libraries should be kept in mind: much of the material in such a library is rarely used! It is typical to find that 50% of the objects collected by a really big library are not requested by anyone during the first 100 years they are in the collection, and that 95% of the objects are used less than once a decade. (One might try to exploit this usage pattern to keep rarely used material on lower-cost, slower storage, but that is not the objective of this design project.)
IV. Summary of Your Task
As indicated above, your team's job is to design the Repository. (Section VI offers some suggestions for writing and organizing your writeup.) The repository is to be used as a component of a larger system in which readers search a catalogs or indexes or otherwise discover references that consist of object identifiers, and they will present this object identifier to the get method of the Repository. Librarians will assemble a collection of bits that repesents a new object, and they will add that object to the Repository by invoking the put method.
Thus as part of the overall report, your design should include discussion supported by pseudocode (which can include, for example, actions by human operators) for at least the following specific procedures:
V. Suggestions (not mandatory, but you may find useful ideas here)
Your paper should be as long as is necessary to explain the problem, your solution, the alternatives you considered, and the reasons for your choices. It should be no longer than that. We estimate that a good paper will be 8 to 10 pages in length.
A good paper begins with an abstract. The abstract is a very short summary of the entire paper. It is not an outline of the organization of the paper! It states the problem to be addressed (in one sentence). It states the essential points of your solution, without any detailed justification. And it announces any conclusions you have drawn. Good abstracts can fit in 100-150 words, at most.
The body of your paper should expand the points made in the abstract. Here you should:
1. Explain the problem and the externally imposed constraints.
You should explain to your intended audience the background of the problem in terms that the audience can understand.
2. Explain and elaborate your solution.
Be sure to explain the approach or architecture conceptually before delving into details, components, and mechanics. (Remember, you aren't writing a mystery novel!)
3. Show how your solution satisfies the constraints and solves the problem (or how it fails to do so);
Explain how the properties of your solution that result from choices you have made in your design are reasonable or desirable in the context of the problem statement. Include analysis of the estimated (or measured, if it applies) performance characteristics of your solution.
4. Describe the alternative approaches that you considered and rejected, and why you rejected them.
Your paper will be more convincing if you say not just why your idea is good, but why it is better than the alternatives. (For example, if another approach would meet all of the objectives perfectly, but the cost would be 100 times higher, then you should mention that as a reason for choosing your less general but cheaper approach.)
5. Document your sources, giving a list of all references (including personal communications). The style of your citations (references) and bibliography should be similar to the styles in the technical papers you're reading in this class. In particular, a bibliography at the end and no citations in the text of your paper is insufficient; we'd like to see what specific pieces of information you learned from where as we read your paper.
Write for an audience consisting of colleagues who took 6.033 five years ago. That is, they understand the underlying system and network concepts and have a fair amount of experience applying them in various situations, but they have not thought carefully about the particular problem you are dealing with. Assume that your paper will also be used to convince the Library staff that you have a viable design. Finally, give enough detail that they can turn the project over to their Information Technology programmers for implementation with some confidence that you won't surprised by the result.
VII. How do we evaluate your paper?
When evaluating your report, your instructor will look at both content and writing.
Some content considerations:
VIII. Revision History
Questions or comments regarding 6.033? Send e-mail to the TAs at
6.033-tas@mit.edu.
Questions or comments about this web page? Send e-mail to
6.033-webmaster@mit.edu.
Top // 6.033 home // Last updated $Date: 1999/04/18 15:38:53 $ by $Author: ilya_shl $