6.033 Handout 33: Design Project #2 Q & A

M.I.T. DEPARTMENT OF EECS

6.033 - Computer System Engineering	Handout 33 - April, 1999
	Latest update - April 28, 1999, 11:00 p.m. (hit RELOAD for latest version)
	(updates include revisions to existing answers)
	[note new answer re: MD5/SHA1 speeds (updated 04-26 7:35pm)]

Design Project #2: Questions and Answers

This handout records questions and answers that are likely to be of wide interest to teams working on design project 2. Check back occasionally, because if it will be updated daily as long as interesting questions come in.

Should we be concerned about possible creation of phantom files, such that it happens in all the replica sites and cannot be detected even by majority polling?

There are at least two things that you might worry about. If a failure occurs in the middle of your protocol for adding a new file you might accidentally add a bogus file to all the replicas. And some algorithms for getting all the replicas to agree have the nasty property that if they find a bogus file on one site they immediately replicate it at all the other sites.

How is the metadata of a file stored? Is it in the directory entry, in the same region as the file, or somewhere else?

We didn't really specify it. One good assumption is that small metadata items are stored in the directory entry, and are thereby subject to the same level of recoverability as that entry. A large metadata item would probably be handled by the file system by placing it in some storage region similar to but distinct from the file contents, and having the same recoverability as the file contents. (So you can't use the file system to provide recoverability for the file contents by keeping a complete copy of the file in the metadata!)

Where is the boundary of the "repository"? The design description talks about operations on the repository, but there are also replicas and file systems. I'm not clear on how these things fit together.

Here is one way to think about it: the repository is a system that provides a network service with a well-defined protocol for putting and getting things, and it also has internal processes that maintain integrity. Among the components of the repository are replicas, which may have their own network protocols for putting and getting things, and their own internal processes for maintaining integrity. Among the components of a replica is a file system, with the specified interface.

What is the budget? We could store everything in RAM to speed things up, and install several dedicated optical fiber links across the Indian Ocean and tunneling the Himalayas to connect Capetown with Novosibirsk. We know that it is a government project, but does that really mean that money is no object?

We didn't ask you to minimize cost or maximize performance because the problem is hard enough as it is. However, you should exert a certain amount of realism--these are your tax dollars, and if your design is grossly expensive, someone else's design will be used instead. RAM has historically cost about 30 times as much as disk, so proposing to keep a backup copy of the entire database in RAM sounds a bit outlandish. And if you need to negotiate agreements with the governments of 21 countries in order to lay your fiber it seems likely that bureaucratic delays will sink your proposal.
As far as performance is concerned, just make your design work well enough to be useful to its intended audience. If a typical get operation takes a week, you did something wrong. On the other hand, if a typical get operation puts a document on my screen in a few seconds, there is no need to reduce that to milliseconds. We intentionally didn't say how many simultaneously reads to plan for, because we want you to focus on integrity, rather than performance.
On the other hand, if you spot an expensive solution that would really simplify things, it is worth mentioning it in your report for possible future use.

In the section that describes the file system methods it says:

status = create-file(directory, metadata)

What is the behavior of this function when you attempt to create a filename that already exists in the directory?

Oops, that got left out of the spec. Assume that it returns a distinguishable error status.

What is the expected cost of the 800GB drives the library will be purchasing in 2005? (in 2005 US$)

The experience for the last 20 years has been that the disk capacity doubles every 18 months but the disk itself stays the same price, so the price per Gigabyte drops in half. Today routine disks are priced around $20/Gbyte, while high-performance disks run more like $60/Gbyte. And compared with the rate of improvement in technology, inflation is in the noise.

I was looking up Seagate's Cheetah 36 (ST136403FC0 hard disk specs, which advertises 5.7 msec average seek time (read) and 6.5 msec (write). Is this the value from the point where the read / write request is made to the time the disk starts getting read from / written to, or is there a missing factor here?

Yes, you are missing rotational delay. Check the paper by Katz, Gibson and Patterson for a brief tutorial on how disks work. That paper is ten years old, so all the disk sizes are a couple of orders of magnitude smaller than today, but the mechanics haven't changed.

The speed of data transfer from disks is around 10 - 12.5 MByte/sec now. Do we assume this stays about the same through 2005?

Data transfer rates are limited by logic and bus speeds. When you double the linear density of data on a disk, and keep the same rotational speed, the rate that bits come off the platter doubles. And if you spin the disk faster to get shorter rotational delay times, the bits come out at yet a higher rate. These increasing raw bit rates are the reason why the industry keeps re-inventing SCSI as Fast-SCSI, Fast-and-Wide SCSI, SCSI 2, SCSI 3, etc. The fastest mass-produced channels today transfer around 80 MBytes/sec., but that speed is designed to cater to high-performance disks. A reasonable guess is that low-performance disks might be transferring around 160 Mbytes/sec in 2005.

Is the 125-disk figure for a single replica a limit, or can we assume there is room for more?

You can probably get away with attaching 150 disks to a single system, but probably not 300. A typical disk channel allows you to string 15 disks on it. There are systems that allow you to attach more than 10 channels, but you are moving out of the mass-produced server world into made-to-order stuff.

If money is no object, the traditional mainframe manufacturers (IBM, Amdahl, etc.) will sell you a machine that can attach 100 or 200 disk channels, and perhaps 1400 disks. But you may have to develop a 3-dimensional machine configuration to keep everything within the cable length specs. And don't try to run all 200 disk channels at the same time without investigating the bus specs to see if it can handle that aggregate data rate.

Where is the library going to store all these bits until 2005?

You can assume that capacity of the current disk technology to store the bits is always well ahead of the rate at which the Library manages to acquire the bits. If they started this year by installing 125 50 Gbyte disks, they could hold 6 million documents, which is more than the 5 million they actually have acquired so far.

What kind of network do we have to work with? Can we expect to have 1MB/sec connections for each client? Or is it 100MB/sec? Or is it only 56k/sec?

Assume something like today's internet, with many telephone-based clients but some with cable modems. Improving communications involves convincing public utilities commissions and digging up streets, so it happens slowly. The replica site in the United States might have a 45 Mbit/sec connection to the Internet, but you will be lucky to get Capetown and Novosibirsk connected at a rate above 1.5 Mbits/second.

With hard drives, what is the likelihood of different errors? How do temporary read failures, sector damage leading to read failures, and whole-disk failures compare in likelihood? What other errors are common with respect to hard drives?

Disk vendors don't post a lot of information on this topic--they prefer to tell you that the minimum seek time is 2.0 ms--and what is posted is somewhat mysterious, unexplained, and possibly even self-serving. Here is what Samsung has to say about their 6.4 Gbyte hard disk. Other vendors post similar numbers, though non-recoverable read errors are often reported to be at rates that are one or even two orders of magnitude higher.

Samsung SV0644A (6.4 Gigabytes) Seek Error: 1 in 10^6 seeks Recoverable Read Error: 1 in 10^10 bits read Non-Recoverable Read Error: 1 in 10^14 bits read MTBF: 500,000 power-on hours MTTR(Typical): 5 Minutes Start/Stop Cycles: 50,000 Component Design Life: 5 Years

Is it safe to define a UID/URN as the MD5 digest of a file? According to the MD5 docs, "it would take approximately 2^50 attempts on average to find 2 inputs producing the same digest." But we only have approximately 2^28 objects to store. This seems to say that the probability of UID/URN collisions based on MD5 digests would be "low."

But design-wise, is this a good design principle to follow? Is this tradeoff justified?

(Not being sure what to make of the 2^50 attempts, we sent this question to Ron Rivest. Here is his answer.)

You can think of MD5 or other hash functions (e.g. SHA1) as being essentially a "random" function of their inputs. Thus, they are a good basis for unique document identifiers, etc.

Given one document D, the effort required to come up with another document D' having the same MD5 value (ie. MD5(D) = MD5(D')) is about 2^128.

On the other hand, the difficulty of coming up with two documents D and D' that have the same MD5 value is about sqrt(2^128) = 2^64, by the birthday paradox.

SHA1 has an output of 160 bits, rather than 128 bits as for MD5. Thus the first effort about is raised to 2^160, and the second is raised to 2^80. I think that a 160-bit output is probably a better design choice for these reasons, and thus I would recommend it.

If you have 2^28 objects in your repository, the chance that two of them have the same SHA1 value can be estimated as

(2^28 choose 2) / 2^160 == 2^(-104) approximately.

Don't worry, be happy...

Can any replica site add items to the repository, or do all additions come from one site only?

You may have missed an important piece of the big picture. Replica sites don't add new items to the repository. Replica sites are components of the repository. Other people, such as librarians sitting at their office workstations, add items to the repository by invoking the put and get operations via network protocols that you design. You get to decide exactly how those operations work--what site or sites the librarian's workstation contacts, what data is sent where, how the protocol decides when the repository has agreed to store the file (that is, when it commits), and so on.

On the other hand, depending on how you do integrity checking, it may be that a replica site on its own figures out that it is missing a copy of an item that the other replicas have, in which case it may take the initiative to get a copy from another replica site. But that would be described as adding a copy to a replica site, not as adding a new item to the repository.

Although there are many designs that will meet the requirements of the design project, most of them probably involve building up the design in several layers, and placing client/server boundaries between some of those layers. You may find it helpful to look at the layering buildup that is illustrated in chapter 8 section B and appendix 8-B. Those examples develop a different set of layers from the ones that you need for the repository, but they do show a style of layer thinking that may be useful. Similarly, you may find it helpful to look at the discussion of multi-site commit in chapter 8, section E. Again, the protocol illustrated there probably isn't the protocol that you would use (it is quite a bit more heavy-duty than you need), but it does illustrate placing a client/server boundary in the middle of an update implementation.

Can we assume that the original digitized source that is entered by the human librarians is correct?

That must be a philosophical question. Nothing created by humans can be assumed to be correct, but it is the best we have. The important point is that if a human librarian said "put it in the repository" the repository shouldn't ask questions (except maybe to notice that it already has another object with exactly the same pattern of bits in it). The responsibility of the repository is to guarantee that things intended to be there survive durably, and that things not intended to be there aren't.

What is meant by 'bogus data' in the description of dp2? Is that data not entered by human librarians, or data that has errors in it, or something else entirely?

"Bogus data" is anything that was not originally supplied as an argument to a repository "put" operation.

Say we want to create a new replica site in London because the replica on the island of Diago Garcia was swept out to sea in a tidal wave. The design project statement says "any procedure that attempts to copy 105,000,000 files is likely to be interrupted a few times before it completes." That seems to imply that we are expected to transfer all the replica data over the Internet to the new replica site. Are we allowed to consider copying the data onto a set of new disks in Washington, and sending them by FedEx to London?

Absolutely. The comment about interruptions was not meant to imply that the copy should necessarily be made over the Internet--it had a completely different purpose. Whether you copy over the Internet or over dedicated fiber lines or directly onto 125 new disks to be shipped by FedEx, you still have to give some careful thought to how you are going to make the copy operation recoverable. Restarting the copy procedure from the beginning every time it is interrupted may not be a winning approach, but reliably restarting in the middle requires being very organized. At the same time, be sure to think about the last sentence of suggestion number 6 of the design project statement.

Is it okay to change the spec for the repository-level put? Currently it takes only one argument, but we would like to pass along some metadata to be stored with the file, and that would require adding another argument.

No problem; the interface suggested in the design project writeup was intended only to illustrate the function and style, not to constrain the details.

Why does remove-file require a data argument? Is that a safety measure or what?

Nope, it is another bug in the spec. It should say just
remove-file(directory, filename)

We want to keep a single bit flag on disk in order to indicate whether or not an event has occurred. We would like to read and write this bit directly without going through the file system. Can we assume that there is an interface to the disk called "write" that will bypass the file system and write a bit (or maybe a byte) directly onto disk? If so, is this atomic? (In lecture, Frans assumed that he could atomically write one bit to disk.)

This approach creates a couple of problems you probably really don't want to get into:

the file system also thinks it is using the disk, so you have to convince it to avoid the tracks you are using

the file system has disk error recovery and revectoring, which you will have to provide

In that case can we assume that writing a single bit to disk with write-file is atomic?

That question is headed down a better path. You can't guarantee that a crash won't occur before the write-file completes, but writing a single bit either will happen or it won't; you don't have to worry about a half-written single bit. On the other hand, when you try to read it, you may find that the track the bit was written on had a hard error and has been revectored.

For purposes of assessing network traffic demands, can we assume that repository-put requests will come predominantly from the Washington, DC area?

That seems like a perfectly reasonable assumption.

We were thinking of using {choose one: Library of Congress call number, Dewey Decimal System call number, Library of Congress cataloging number, International Standard Book Number, the book title} as object identifiers, but we would have to extract them from deposited objects--if they are in there--and that seems like an abstraction violation.

Yes, peering inside the objects the librarians give you should be a no-no. The right way to do this is to respecify the put-repository request, and require that the librarian supply whatever piece of metadata you were thinking was appropriate.

If the replica sites run the same software, do the same mechanisms/protocols always necessarily produce the same errors (if there are any)? That is, given the same circumstances, do the bugs of the same mechanisms in different sites always generate the same results?

Unfortunately, the answer is yes and no. Some bugs replicate very nicely; when presented with the same inputs the program always fails in exactly the same way. Other bugs--particularly coordination errors, which are time-dependent, and bugs that involve reading uninitialized data--typically produce different results each time, because they depend on the state of other things that happened previously or are happening at the same time.

Should we compile statistics for the failure modes of hardware (disk, processor, etc.), software, and network, or do we just focus on the fault analysis of our protocols? If we need to worry about the former, how can we provide a long-term analysis if the specifications for the components constantly change over time?

We were hoping you would do a brief fault analysis of the various things that can go wrong both for hardware and software, like appendix B of chapter 8 does for atomic-put. But that wouldn't be time-dependent--the nature of the errors for a given technology doesn't change that much with time. It sounds like you may be thinking of calculating numerical reliabiliiy or MTBF's. That isn't required, though there are a few places where it might be a good idea to support your design. Rough back-of-the-envelope estimates that are to the nearest order of magnitude or so are probably more than adequate for that, and again, those won't change that much with time.

Since Prof. Rivest recommended using SHA1 over MD5 as a hash function, our group is wondering how fast SHA1 runs and what the prospects are for speedups over the next five years. And is anyone likely to implement SHA1 in hardware?

(From Rivest) Schneier's book Applied Cryptography on page 456 gives a table comparing speeds of a dozen different hash algorithms on an Intel 486SX running at 33 MHz. The following two entries from that table are pertinent to the question:

SHA 75 Mbytes/second MD5 174 Mbytes/second

SHA1 is a slightly tweaked version of SHA that probably runs at about the same speed. For those performance levels on a 33 MHz machine, they must have been working with data in RAM--perhaps even preloaded in cache--because both rates are faster than the usual disk channel. A modern 350 MHz machine could presumably do both functions 10 times faster, so the time to do either SHA1 or MD5 looks negligible compared with the time required to read the file in from the disk.

Wow! Those must have been really hot implementations of MD5 and SHA that were reported in Schneier's book (see Q & A above). A 33 MHz 485 doing one 4-byte add from the cache per machine cycle would only be able to add numbers up at 132 Mbytes/second. How were they able to get MD5 to run faster than addition?

Apparently you can't believe everything you read. After a bit more asking around, Ron Rivest turned up what seems to be a competent, properly-explained comparison of a large number of hash and encryption implementations, at http://www.esat.kuleuven.ac.be/~bosselae/fast.htm.

The numbers posted there, for a 90 MHz Pentium with optimized code and with all data already in the on-chip cache are

Mbit/sec Mbyte/sec SHA-1 55 7 MD5 137 17

These numbers seem much more plausible than the ones quoted by Schneier. If we again assume that these numbers scale with clock speed, it appears that a late-model 360 MHz Pentium running either hash can easily keep up with a modest-performance disk channel, although it would not be a negligible effort.

We are still puzzled over the suggestion that we do an error event analysis. What is its scope? Is it just on the subsystems we design? Does it include the hardware? The whole system? What is its focus?

The event analysis should really be part of your design procedure. Start with the components you are building on: replica sites running the given file system, with a network interconnecting them, and list the anticipated errors, classify them as tolerable or intolerable, and identify the intolerable errors you intend to do something about. Then begin adding layers, processes, and procedures, doing new error event analyses as you go, so that the reader can tell just what you intend each subsystem or layer to do. You should end up with an error event analysis for the repository as a whole. The example in chapter 8, page 8-24 demonstrates the idea, and that demonstration continues in Appendix 8-B, page 8-79. Remember that the purpose of the event analysis is to get your assumptions out on the table; in particular to clarify exactly what errors you believe that various components of your system provide for.

What's a reasonable limit on how long a new replica creation should take? Six months seems too long, and a week is probably fine. What about 2 weeks? A month? It seems difficult to compute an MTTF for a massive site failure caused by things like intentional destruction.

The systematic answer is that if there are nominally N replicas, and K replicas must be operational in order to handle routine disk decay events and network outages, then the chance that N-K+1 sites are lost to disasters at the same time must be kept tolerably small. So if you propose an MTTR (for replica creation) the person who configures the system can then calculate how large a difference to maintain between N and K to attain a particular MTTF (for repository loss) assuming the MTTF (for replica loss) is known. The next step is to consult an insurance actuary about the MTTF for replica loss.

Whatever the actuary says, you want to make the MTTR for replica creation as small as economically reasonable, because smaller MTTR means that fewer replica sites are needed. Your intuition about acceptable MTTR's is based on what seems easily attainable, and that is an appropriate way to proceed.

We were wondering if the information coming across the repository put interface is compressed. If it is not, we are thinking of compressing it to reduce disk read time and data transfer time.

The Library of Congress, being versed in end-to-end arguments, chooses storage formats that it considers appropriate for its different classes of materials; that includes carefully thinking through which compression algorithms to use for each class. So you should assume that compression is taken care of before the data gets handed to the repository, and that people who call repository-get will be able to figure out how to decompress whatever you hand them. (The Library will probably have also signed the data and in some cases maybe even encrypted it. All of these activities should be considered the responsibility of some layer above the repository.)

When deciding which disk to use for adding new data, We'd like to have some idea of space in use/available on each disk. There's no method provided for this, short of trying to create files on each disk, and checking status. is it reasonable to request ("insist on") an fs method that returns disk usage info for mounted disks?

Adding a get-disk-metadata method seems quite compatible with the statement in the design project writeup that says,

You may discover other requirements or features for the file system. If they seem consistent with the general design philosophy so far, you can request that the file system designers add those features or meet those requirements.

The idea behind that provision is not to discourage suggestions for features, but rather to get you to give a little thought to how reasonable a proposed feature is.

It says in the Q & A that we can assume a disk rate in 2005 of 160 MBytes/second. A back-of-the-envelope calculation suggests that at that rate it would be possible to read and check every file in a replica in about a week. My intuition says that is optimistic. What is missing from the back of my envelope?

The main problem is that your seven-day calculation is based on the assumption of continuous reading from a disk at its peak transfer rate, which is actually hard to sustain. For example, you have to stop to move the seek arms every once in a while; every time you move a seek arm some rotational delay will almost certainly follow. The data has to be laid out in exactly the right order to allow continuous reading. The disk has to be designed so that after you read all the data from one track you can immediately start reading the data from another track in the same cylinder without waiting for one rotational delay. Client reads, updates from the librarians, recovery from recoverable read errors, and repair activities will also stretch that number out. So you may be able to read one file at that rate, but without extreme cleverness, reading a whole replica is likely to take five to ten times as long.

Even if we assume that non-recoverable read errors are ten times as high as were quoted for the Samsung disk, they are so rare that the chance of exactly the same sector being lost in two replicas is vanishingly small. So what is the big deal about hurrying to fix missing sectors?

Yes, that is the power of replication. But don't let things go for thirty years before you get around to fixing them, or you may be surprised to find that when you multiply vanishingly small by enormously many, you get probabilities of failure that you actually have to think about.

Keep in mind that a file occupies an average of 2,000 sectors. And unless you plan to do patching of individual sectors, you will probably handle loss of any one sector of a file as loss of the whole file.

Should we optimize our system to allow users to use the closest site to them for reads?

Thanks to design project one, you can safely assume that everyone in the known universe is an expert on distributing clients among replicas, and you can therefore dismiss this concern with one sentence or less in your writeup. For this design project, concentrate on reliability, not performance.

The file system spec says that when bad sectors are discovered they are marked so that they will never be used again. Yet it also says that write-file may store an incomplete file. This seems inconsistent.

The wording "may store an incomplete file" means that the file system is rather simple-minded. If the last time it used sector 914287, that sector was OK, it will try to write a block of your file there; if this write operation reveals that the sector has turned bad, then it will mark the sector bad, and in addition that block of your file just doesn't get written. A cleverer file system might immediately look for another place to put this block, but our file system designer was paid to be simple and predictable, not clever (and potentially buggy). Besides, that sector might go bad a moment after the write, which produces exactly the same effect.

As we write this thing up we are having a hard time squeezing it into the suggested 8-10 pages. Are we going to lose big if we go over that limit?

"Your paper should be as long as is necessary to explain the problem, your solution, the alternatives you considered, and the reasons for your choices. It should be no longer than that. We estimate that a good paper will be 8 to 10 pages in length."

So 8-10 pages is an estimate, not a hard limit. A paper longer than that won't be automatically assumed to be a loser, but the reader will grow impatient if the content doesn't justify the length; this impatience undoubtedly grows non-linearly with length.

One area that design teams may be able to shorten: the assignment suggests that the first thing the paper should do is "Explain the problem and the externally imposed constraints". This section should not simply repeat what it says in the design project handout. The real purpose of that section is for you to point out problems and constraints that the group addressed and that were not especially obvious from the original problem statement.

Beyond that, if your paper is coming out to 20 pages, it may be that you are working on a bigger problem than the one you were asked to do. Or that your solution is more complicated than necessary.

Go to 6.033 Home Page Questions or Comments: 6.033-tas@mit.edu