6.033 Discussion Suggestions (RAID paper)

6.033--Computer System Engineering

Suggestions for classroom discussion of:

Randy H. Katz, Garth A. Gibson, and David A. Patterson. Disk system architectures for high performance computing. Proceedings of the IEEE 77, 12 (December, 1989) pages 1842-1857.

by J. H. Saltzer, revised April 15, 1996 with many suggestions from Greg Ganger . Minor update with 2000 disk prices, April 2000 and small corrections January 2001. Updated April 2001 with 2001 disk prices.

Note that the preparation of M.I.T. students for this paper is variable. Some are very familiar with the principles of error detection and correction, while others know only what it says in chapter 7. Some have a good intuition for how disks work, others don't have a clue. This paper is the opportunity to bring everyone up to the same level in both areas.

A second point to note is that since the time of this paper a small industry has grown up around design and implementation of RAID systems. While providing the high reliability suggested by the RAID paper, these modern systems also try to insure very high availability and they fine-tune performance in sometimes exotic ways. The discussion here generally ignores both availability measures and anything other than gross performance issues.

Related to this second point is that the various RAID levels involve two orthogonal concepts: varying forms of redundancy, and varying forms of interleaved data organization. The paper is presented primarily from the point of view of reliability, and it gives the impression that the choices of various data organizations are driven primarily by reliability considerations. In practice, choice of data organization is driven by performance considerations.

The data organization itself has at least two parameters. The first is the depth of interleaving. If you have N disks and interleave across K of them, you can set K = N (deep interleaving), 1 < K < N (shallow interleaving) or K = 1 (no interleaving.) The second data organization parameter is grain: you can interleave bits, bytes, words, sectors, tracks, or cylinders. Since sectors are the minimum quantity that the disk can actually read or write, setting the grain finer than that doesn't accomplish much, even though this paper describes such organizations.

One can construct the following matrix in which the two parameters of data organization go across the top and the various redundancy techniques run down the left:

             ------------------ interleaving -----------------
                  none             shallow           deep
grain:                         coarse  fine     coarse  fine
______________________________________________________________

irredundant                    ---------- RAID 0 -----------

replication      RAID 1

ECC                                                     RAID 2

Parity          ---------RAID 4------          RAID 3'  RAID 3

with RAID 3' being a simplifying variant on RAID 3 and RAID 5 being an implementation hack that improves RAID 4. One could design a system using any point in this matrix. The paper creates an artificial linearization involving just five matrix entries; this linearization makes it appear that the design considerations are driven by reliability. It may provide some insight to construct the above matrix while going through the discussion below.

Buzzword patrol. What is the fad word for "interleaving"? (Striping.) What is a sector? (The smallest unit that can be read from a disk, generally the unit protected by EDC and ECC.) What is a track? (All of the sectors that can be read from one side of a single platter without moving the heads.) What is a cylinder? (All of the tracks that can be read without moving the heads.) What is a zero-latency read? (Suppose you give the controller a read request for a full track. The disk controller can begin taking data off the disk as soon as the next sector comes under the read head; it isn't necessary to wait until the disk rotates to the first sector of the track. The rotation of the sectors to the correct position is done in a RAM buffer before delivering the data to the client.) What do EDC and ECC stand for? (Error-Detecting Code and Error-Correcting Code. Details later.)
This paper was written in June, 1988. Now, a typical low-performance mass market disk specification is
```
  Internal hard drive April 2000 (From CompUSA 4/23/01)
     Maxtor V0696000H  IDE ATA/66 interface w/ 2 Mbyte buffer
     60 GBytes
     $180
     7 watts (est.)
     3.5 inch
     5400 RPM
     9.0 ms. average seek time
```
Are the technology curves described then holding up today?
(The curves at the beginning of the paper discuss only technology changes, not cost. But look at figure 11, page 1852. Figure 11 suggests a drop from $28/Mbyte to $4/Mbyte between 1982 and 1990, an improvement of about 35% per year. Carrying that rate forward another 11 years to 2001 we should have about $35/Gbyte. The price quoted above is $3/Gbyte. Disk prices have actually dropped about 47% per year since 1990, quite a bit faster than they did in the years covered by the paper.)
What are the numbers that would be filled in for figure 12 (page 1852) today? (We are still in the rightmost column, with 3.5-inch disks. But the 60 Gbyte disk is 300 times bigger, so the MB/watt and MB/cuft are 300 times bigger. Packing density is now 3000 Gbytes/cubic foot. If you are short of room for your digital library, spinning disks now have better volumetric efficiency than CD's in jewel boxes! 120 JewelBoxes/cu ft close-packed and filled with DVD's holding 6 Gbytes each provides only 720 Gbytes/cu ft)
What is figure 9 on page 1848 telling us? (Simply put, that reading small objects from a disk burdens an I/O system differently from reading big objects. But this picture is a good opportunity to verify that the everyone in the class understands the underlying causes of seek time, rotational latency, and data transfer bandwidth.)
What is interleaving (sometimes called AID or RAID level 0 in advertising, but not in this paper)? (If you have N disks, write a file by putting the first {bit, byte, sector, block, track, cylinder} of data on disk one, the second on disk two, ..., the Nth on disk N, the N+1st on disk one, etc.)
What is the point? (You can write to and read from all N disks in parallel, increasing the data bandwidth by a factor of N. If your application is in a hurry, this can be a win.)
Why do I have to interleave to get that factor of N? (You don't. But if you don't interleave, the data on disk 2 probably is completely unrelated to the data on disk 1, so your application isn't very likely to ask to read from both at the same time. If you put the first block of a file on disk 1 and the second block of the same file on disk 2, then when an application asks to read the file you can fetch both blocks at the same time. Interleaving is a strategy for taking advantage of a common pattern of access to data, converting a sequential pattern into a parallel one. We are exploiting spatial adjacency, but in a different way from a cache exploitation of that same property.)
- What happens if you interleave across just some the disks? (This is called shallow interleaving, and it gets you only part of the possible data bandwidth without requiring that you tie up all the disks on every read or write.[Create table and insert RAID 0])
- When is shallow interleaving preferred over deep interleaving? (When the natural unit of data size for the application is less than N sectors.)
So what is RAID 1? (Just write everything in the same place on two--or N--disks. No interleaving. [Add RAID 1 to the table])
What is the difference between "mirror" and "shadow"? (This paper seems to use the word "mirror" to mean "exactly two copies--counting the original as one of them. "Shadow" here means "N copies, for N > 2.) 8.1 How on earth did we end up with such ridiculously overlapping terminology? (These things often arise because different companies, organizations, etc. each pick a name without considering, knowing or caring that someone else has already done so. So, Digital calls it "shadowing", IBM calls it "duplexing", Berkeley calls it "RAID 1" and "mirroring" came from somewhere else. "Shadowing" is not really a generalization of "mirroring" it is just an accident of the fact that Digital supported--and promoted, for read performance purposes--multi-copy replication and the others didn't.
What is the slowdown factor "S" all about? For example, it says that the write bandwidth is D/S writes per second compared to a single disk. (If I write one copy, there will be some seek + rotation delay, averaging about half the maximum amount for that disk. If I write two copies on different, independently rotating, disks at the same time, I can't claim the write is complete until *both* have been written, so I have to wait till the *longer* of the two seek + rotation times. For N copies, I have to wait till the *longest* seek + rotation time of any of the N disks. As N grows, one would expect this delay to increase from about half the maximum delay to something closer to the maximum delay. S is a divisor whose value is between 1 and 2.)
What about the "D" part? (This paper uses "D" to mean the number of different places you can write data, and for mirroring that is the number of two-disk groups, not the number of disks. In principle you can have an independent write going to each such group. So the maximum attainable bandwidth ought to be D times as great as for one disk, degraded by the factor S, so the result is D/S.)
Then why does it say "RAID Level 1 can perform 2D/S sector reads/second compared to a single disk." (First, that paragraph is talking about mirroring, which means N = 2. There are D groups of 2 disks each, for a total of 2D disks and if you get lucky you can be reading a different sector from every one of the disks at the same time, which accounts for the 2D. The slowdown factor S is a little harder to explain--in fact, the paper doesn't explain it anywhere. If your interface to the client requires waiting to deliver any data until all the data is available, then the same analysis as for writes applies. If you can begin delivering data as soon as the first sector comes off the disk, S would not apply. Note that you lose this option when you start to add interleaving. Note also that when you get to part C of the paper, the slowdown factor silently disappears from the comparison discussion.)
If things slow down on multiple-copy writes, shouldn't they speed up on multiple-copy reads? (Yes! If you were clever enough to post reads for both copies of a mirrored sector, and return the first one that comes back, you should on average have a shorter latency, and the slowdown factor becomes a speedup factor instead. This tactic does use up bandwidth, though, so it is unlikely to be worthwhile unless you are smart enough to predict which copy is going to come to a read head first and post a read only to that one. Some disks can report their position, with the goal of making this kind of cleverness feasible.)
So what is wrong with RAID 1 that would cause anyone to want to do anything more elaborate? (The factor of N reduction in disk capacity is a bigger hit than many people are prepared to pay.)
What does RAID 2 do to improve that? (It applies an Error-Correcting Code to the data, just as if it were RAM.)
How does RAM ECC work? (Give an example. the simplest one that has intuitive pedagogical value is probably to use 4 bits of data with three check bits. The three check bits carry exactly enough information to distinguish 8 cases: no error, error in bit 1, error in bit 2, ... error in bit 7.
```
                 P3      P2  P1
bit  7    6   5   4   3   2   1
     +        +       +       =    bit 1 is exor of every other bit
     +    +           +   =        bit 2 is exor of every other pair
     +    +   +   =                bit 4 is exor of every other four
```
Suppose bit 5 flips. When we recalculate the ECC, P3 and P1 won't agree with the bits that came with the word. If we write P3,P2,P1 showing the troubled bits as ones, we notice that their binary value is 101, or 5. They are pointing at the culprit.
The efficiency of this example is a little misleading. If we had chosen a 63-bit code we could have done the same thing with 56 data bits and 7 for parity checks.)s
How do we apply this idea to an array of disks? (Buy 63 disks. Suppose they are organized in 1 Kbyte sectors. Tell your customers that they have to write data in 56 Kbyte blocks. When you are handed a 56Kbyte block, plan to put the first bit on sector one of disk 1, the second bit on sector one of disk 2, etc., continuing till you get to disk 56. Then calculate an ECC code of 7 more bits and write them as the first bit on disks 57 through 63. Now go back to disk one and place bit 57 in the second bit of the first sector, etc. Of course you can't begin writing anything until you have laid all the bits out in RAM, but that is a detail that doesn't affect the idea. [Add RAID 2 to the table])
What is the problem with this approach? (1. 56 Kbyte blocks are pretty big; there may not be a lot of applications with that much to write. 2. To read anything you have to launch 56 parallel reads, and wait for the last one to complete. The read slowdown factor is probably pinned at 2, and we are really soaking up bandwidth.)
How does RAID 3 fix this? (By recognizing that erasure errors are easier to correct than unknown errors. The RAM ECC is designed assuming that we have no idea which bit might be in error. But modern disk controllers are fail-fast: they write each sector of data on the disk with an Error-Detecting Code (EDC). Then, when that controller reads back the data it can check the EDC and return a status report. If the disk controller reports that a sector contains bad data, we don't need to have the RAID ECC point to the bad disk, what we need is to figure out what data value was on that bad disk. One bit of parity for each bit of a sector is sufficient for that. [Add RAID 3 to the table])
So we don't need 63 disks, we can get along with 57? (Better. We can apply this approach to any number of disks, e.g. 3 data disks and one parity disk. That way we only demand of the client a data transfer size of 3 Kbytes.)
So why doesn't anyone actually use RAID 3? (Because bit-interleaving would would take extra steps to implement and be fragile, the designer can gain simplicity and robustness by switching to sector interleaving. If we call that RAID 3', we get to fill in another slot in our table, above. There are real, marketable products using RAID 3', though they are advertised as RAID 3 because that is the closest thing in this paper and it is easier to market something with the same label that the paper used.)
But doesn't sector interleaving make it harder to calculate the ECC in RAID 2 or the parity bits in RAID 3? (Not really. In RAID 3 all you need to calculate is parity, and that can be done rapidly with word-wide exclusive-or's. This is easier to see if the exact bit layout is on the board. For sector-interleaved RAID 2, the code we described above is also done entirely with exclusive-or's. The program may look a little complicated, but it should be fast--if you ever get it debugged.)
Unfortunately, even RAID 3' doesn't address the problem that we are forced to read every data disk for any read. How does RAID 4 fix that? (By realizing that parity works on unrelated data just as well as it does on related data. We can switch to shallow interleaving or even stop interleaving completely, to avoid the need to get data from every disk on every read. The non-interleaved form works as follows: On sector one of disk one, write a normal 1 Kbyte block of data. On sector one of disk two, write another normal 1 Kbyte block of data. And do the same on sector one of disk three. On sector one of disk four, write a block that consists of the exclusive OR of the three blocks written to sector one of the other three disks. [add RAID 4 to the table])
Why does that help? (Now you can get a block of data by launching a single disk read, say to sector nine of the second disk. The disk controller will tell you if the data on sector nine is good. If it isn't, only then do you read all the other sector nines, and calculate their exclusive OR to reconstruct the sector nine on the disk that you couldn't read. So this tactic optimizes the common case, when there are no disk errors.)
Isn't writing still a mess? And updating a single block of data even worse? (Not as bad as it looks. You might think in order to rewrite a single sector you have to read all of the sectors in the group in order to calculate the new exclusive OR sector. But actually all you have to do (barring a read error) is read the sector you are going to rewrite and the parity sector. If you exclusive-OR these two, you get the same thing you would have gotten by reading the other sectors and exclusive-ORing them. Then EXOR the new data into the parity and write the new data sector and the parity sector. Two reads, two writes, no matter how many sectors are participating in the parity calculation.)
How can RAID 5 possibly improve on this? (Another clever implementation hack. The problem it addresses is that every write involves a write of the parity disk, so with heavy writing that disk might become a bottleneck. So don't use the same disk for parity on every sector. To do this hack, you need to invent a systematic way of figuring out which disk does hold the parity for any given sector. One way: With N disks, for sector Z use disk (Z mod N) as the parity disk. Disk 1 then holds parity for sectors 1, N+1, etc.)
And RAID 6? (Totally speculative: apply a full-bore ECC to protect against multiple simultaneous erasures. The two-dimensional array version in the paper is just an example of an easy-to-understand ECC. So far, noone has ever actually found it useful. The industry has gone off in different directions, inventing different error-correcting codes that can handle multiple erasures; these are sometimes referred to as RAID 5+ or RAID 6 in current papers. Any time you hear a RAID number larger than 5, ask for a definition; there isn't any standard practice.)

In analyzing performance, the following table may be of some help. It was assembled by Carl Manning in 1994, comparing the RAID levels. Note two things:
1. It omits the slowdown factor in places where it doesn't belong, so it disagrees with the paper in a few places.
2. in the Shadow line of RAID 1 the capacity, bandwidth, and throughput formulas seem to be wrong; C should be replaced with C+1.

RAID = Redundant Array of Inexpensive Disks
G = number of data disks in a group
C = number of check disks in a group
T = total number of disks (including check disks)
D = total number of disks with data (not including check disks)
N = D/G = number of groups
S = slowdown factor to synchronize disks. 1 < S < 2.
w,h = width,height numbers of data disks in 2d array.

f.t. = fault tolerant
```
                      capacity          bandwidth          max req throughput
                    ------------- --------------------- -----------------------
                    utili-  f.t.     read       write      read       write
level  arch.  G  C  zation overhd ratio util ratio util ratio util ratio util
-------------------------------------------------------------------------------
  ID : 1disk  -- -  100%   ---      1  100%    1  100%    1  100%    1  100%
 AID : Ndisks  1 0  100%   ---    T=D  100%    D  100%    D  100%    D  100%
RAID1: Mirror  1 1   50%   100%  T=2D  100%  D/S  1/2S   2D  100%    D   50%
       Shadow  1 C  1/C      C   T=CD  100%  D/S  1/CS   CD  100%    D   1/C
RAID2: 10bECC 10 4   71%    40%   D/S  71%/S D/S  71%/S   N    7%    N    7%
       25bECC 25 5   83%    20%   D/S  83%/S D/S  83%/S   N    3%    N    3%
RAID3: BitPar  G 1 G/(G+1) 1/G    D/S        D/S          N 1/(G+1)  N  1/(G+1)
RAID4: SecPar  G 1 G/(G+1) 1/G      D        D/S          D G/(G+1) N/2 1/(G+1)
RAID5: RotPar  G 1 G/(G+1) 1/G      T        D/S          T  100%   T/4  25%
RAID6: 2dPar w*h w+h                T        D/2S?        T  100%   T/6  16%
```
Assume for high bandwidth writes in RAID 4 or 5, can compute parity for group of sectors, so no read needed.
Shows max, not *expected* throughputs---contention not taken into account.
RAID 1 -- Mirrored Disk: drives can do independent reads, but both must be written during write.
RAID 2 -- ECC: interleaved at bit level, using error correction codes; all drives read and write together.
RAID 3 -- Bit Error Detect+Parity: interleaved at bit level, but since a drive signals read error, only 1 extra drive for parity bit is needed to correct 1 bit error. All drives read and write together.
RAID 4 -- Sector Error Detect+Parity: interleaved at sector level. Since a drive signals read error, only 1 extra drive for parity sector is needed to enable correction. Drives can do independent sector reads. For writes, both data drive and parity drive must perform READ(recompute-parity)WRITE together, quadrupling I/O accesses, thus quartering performance. All drives read together only to correct error.
RAID 5 -- SectorErrDectect+RotatedParity: like raid 4, but rotate parity sector among drives to enable concurrent writes.
RAID 6 -- Two dimensional parity: like 5, but arrange disks in 2d array and keep rotated parity sectors for both rows and columns. 3 disks must be accessed for a small write, capping performance at 1/6.

Comments and suggestions: Saltzer@mit.edu