6.033--Computer System Engineering
Suggestions for classroom discussion of:
Randy H. Katz, Garth A. Gibson, and David A. Patterson. Disk system
architectures for high performance computing. Proceedings of the IEEE
77, 12 (December, 1989) pages 1842-1857.
by J. H. Saltzer, revised April 15, 1996 with many suggestions from
Greg Ganger . Minor update with 2000 disk prices, April 2000 and
small corrections January 2001. Updated April 2001 with 2001 disk prices.
Note that the preparation of M.I.T. students for this paper is variable.
Some are very familiar with the principles of error detection and
correction, while others know only what it says in chapter 7. Some have a good intuition for how
disks work, others don't have a clue. This paper is the opportunity to
bring everyone up to the same level in both areas.
A second point to note is that since the time of this paper a small
industry has grown up around design and implementation of RAID systems.
While providing the high reliability suggested by the RAID paper, these
modern systems also try to insure very high availability and they
fine-tune performance in sometimes exotic ways. The discussion here
generally ignores both availability measures and anything other than
gross performance issues.
Related to this second point is that the various RAID levels involve two
orthogonal concepts: varying forms of redundancy, and varying forms of
interleaved data organization. The paper is presented primarily from
the point of view of reliability, and it gives the impression that the
choices of various data organizations are driven primarily by
reliability considerations. In practice, choice of data organization
is driven by performance considerations.
The data organization itself has at least two parameters. The first is
the depth of interleaving. If you have N disks and interleave across K
of them, you can set K = N (deep interleaving), 1 < K < N
(shallow interleaving) or K = 1 (no interleaving.) The second data
organization parameter is grain: you can interleave bits, bytes,
words, sectors, tracks, or cylinders. Since sectors are the minimum
quantity that the disk can actually read or write, setting the grain
finer than that doesn't accomplish much, even though this paper
describes such organizations.
One can construct the following matrix in which the two parameters of data
organization go across the top and the various redundancy techniques
run down the left:
------------------ interleaving -----------------
none shallow deep
grain: coarse fine coarse fine
______________________________________________________________
irredundant ---------- RAID 0 -----------
replication RAID 1
ECC RAID 2
Parity ---------RAID 4------ RAID 3' RAID 3
with RAID 3' being a simplifying variant on RAID 3 and RAID 5 being an
implementation hack that improves RAID 4. One could design a system
using any point in this matrix. The paper creates an artificial
linearization involving just five matrix entries; this linearization
makes it appear that the design considerations are driven by
reliability. It may provide some insight to construct the above matrix
while going through the discussion below.
- Buzzword patrol. What is the fad word for "interleaving"?
(Striping.) What is a sector? (The smallest unit that can be read
from a disk, generally the unit protected by EDC and ECC.) What is a
track? (All of the sectors that can be read from one side of a single
platter without moving the heads.) What is a cylinder? (All of the
tracks that can be read without moving the heads.) What is a
zero-latency read? (Suppose you give the controller a read request for
a full track. The disk controller can begin taking data off the disk as
soon as the next sector comes under the read head; it isn't necessary
to wait until the disk rotates to the first sector of the track. The
rotation of the sectors to the correct position is done in a RAM buffer
before delivering the data to the client.) What do EDC and ECC stand
for? (Error-Detecting Code and Error-Correcting Code. Details later.)
- This paper was written in June, 1988. Now, a typical
low-performance mass market disk specification is
Internal hard drive April 2000 (From CompUSA 4/23/01)
Maxtor V0696000H IDE ATA/66 interface w/ 2 Mbyte buffer
60 GBytes
$180
7 watts (est.)
3.5 inch
5400 RPM
9.0 ms. average seek time
Are the technology curves described then holding up today?
(The curves
at the beginning of the paper discuss only technology changes, not cost.
But look at figure 11, page 1852. Figure 11 suggests a drop from
$28/Mbyte to $4/Mbyte between 1982 and 1990, an improvement of about 35%
per year. Carrying that rate forward another 11 years to 2001 we should
have about $35/Gbyte. The price quoted above is $3/Gbyte. Disk prices
have actually dropped about 47% per year since 1990, quite a bit faster
than they did in the years covered by the paper.)
- What are the numbers that would be filled in for figure 12 (page 1852) today?
(We are still in the rightmost column, with 3.5-inch disks. But the
60 Gbyte disk is 300 times bigger, so the MB/watt and MB/cuft are 300 times
bigger. Packing density is now 3000 Gbytes/cubic foot. If you are short of room for
your digital library, spinning disks now have better volumetric efficiency
than CD's in jewel boxes! 120 JewelBoxes/cu ft close-packed and filled
with DVD's holding 6 Gbytes each provides only 720 Gbytes/cu ft)
- What is figure 9 on page 1848 telling us? (Simply put, that reading
small objects from a disk burdens an I/O system differently from reading
big objects. But this picture is a good opportunity to verify that the
everyone in the class understands the underlying causes of seek time,
rotational latency, and data transfer bandwidth.)
- What is interleaving (sometimes called AID or RAID level
0 in advertising, but not in this paper)? (If you have N disks,
write a file by putting the first {bit, byte, sector, block,
track, cylinder} of data on disk one, the second on disk two,
..., the Nth on disk N, the N+1st on disk one, etc.)
- What is the point? (You can write to and read from all N disks in
parallel, increasing the data bandwidth by a factor of N. If your
application is in a hurry, this can be a win.)
- Why do I have to interleave to get that factor of N? (You don't.
But if you don't interleave, the data on disk 2 probably is completely
unrelated to the data on disk 1, so your application isn't very likely
to ask to read from both at the same time. If you put the first block
of a file on disk 1 and the second block of the same file on disk 2,
then when an application asks to read the file you can fetch both
blocks at the same time. Interleaving is a strategy for taking advantage
of a common pattern of access to data, converting a sequential pattern
into a parallel one. We are exploiting spatial adjacency, but in a
different way from a cache exploitation of that same property.)
- What happens if you interleave across just some the disks? (This
is called shallow interleaving, and it gets you only part of the
possible data bandwidth without requiring that you tie up all the disks
on every read or write.[Create table and insert RAID 0])
- When is shallow interleaving preferred over deep interleaving?
(When the natural unit of data size for the application is less than N
sectors.)
- So what is RAID 1? (Just write everything in the same place on
two--or N--disks. No interleaving. [Add RAID 1 to the table])
- What is the difference between "mirror" and "shadow"? (This paper
seems to use the word "mirror" to mean "exactly two copies--counting the
original as one of them. "Shadow" here means "N copies, for N > 2.)
8.1 How on earth did we end up with such ridiculously overlapping
terminology? (These things often arise because different companies,
organizations, etc. each pick a name without considering, knowing or
caring that someone else has already done so. So, Digital calls it
"shadowing", IBM calls it "duplexing", Berkeley calls it "RAID 1" and
"mirroring" came from somewhere else. "Shadowing" is not really a
generalization of "mirroring" it is just an accident of the fact that
Digital supported--and promoted, for read performance
purposes--multi-copy replication and the others didn't.
- What is the slowdown factor "S" all about? For example, it says
that the write bandwidth is D/S writes per second compared to a single
disk. (If I write one copy, there will be some seek + rotation delay,
averaging about half the maximum amount for that disk. If I write two
copies on different, independently rotating, disks at the same time, I
can't claim the write is complete until *both* have been written, so I
have to wait till the *longer* of the two seek + rotation times. For N
copies, I have to wait till the *longest* seek + rotation time of any
of the N disks. As N grows, one would expect this delay to increase
from about half the maximum delay to something closer to the maximum
delay. S is a divisor whose value is between 1 and 2.)
- What about the "D" part? (This paper uses "D" to mean the number
of different places you can write data, and for mirroring that is the
number of two-disk groups, not the number of disks. In principle you can
have an independent write going to each such group. So the maximum
attainable bandwidth ought to be D times as great as for one disk,
degraded by the factor S, so the result is D/S.)
- Then why does it say "RAID Level 1 can perform 2D/S
sector reads/second compared to a single disk." (First, that
paragraph is talking about mirroring, which means N = 2. There
are D groups of 2 disks each, for a total of 2D disks and if you
get lucky you can be reading a different sector from every one of
the disks at the same time, which accounts for the 2D. The
slowdown factor S is a little harder to explain--in fact, the
paper doesn't explain it anywhere. If your interface to
the client requires waiting to deliver any data until all the
data is available, then the same analysis as for writes
applies. If you can begin delivering data as soon as the first
sector comes off the disk, S would not apply. Note that you lose
this option when you start to add interleaving. Note also that
when you get to part C of the paper, the slowdown factor silently
disappears from the comparison discussion.)
- If things slow down on multiple-copy writes, shouldn't they speed
up on multiple-copy reads? (Yes! If you were clever enough to post
reads for both copies of a mirrored sector, and return the first one
that comes back, you should on average have a shorter latency, and the
slowdown factor becomes a speedup factor instead. This tactic does use
up bandwidth, though, so it is unlikely to be worthwhile unless you are
smart enough to predict which copy is going to come to a read head
first and post a read only to that one. Some disks can report their
position, with the goal of making this kind of cleverness feasible.)
- So what is wrong with RAID 1 that would cause anyone to want to do
anything more elaborate? (The factor of N reduction in disk capacity
is a bigger hit than many people are prepared to pay.)
- What does RAID 2 do to improve that? (It applies an
Error-Correcting Code to the data, just as if it were RAM.)
- How does RAM ECC work? (Give an example. the simplest
one that has intuitive pedagogical value is probably to use 4
bits of data with three check bits. The three check bits carry
exactly enough information to distinguish 8 cases: no error,
error in bit 1, error in bit 2, ... error in bit 7.
P3 P2 P1
bit 7 6 5 4 3 2 1
+ + + = bit 1 is exor of every other bit
+ + + = bit 2 is exor of every other pair
+ + + = bit 4 is exor of every other four
Suppose bit 5 flips. When we recalculate the ECC, P3 and P1 won't
agree with the bits that came with the word. If we write P3,P2,P1
showing the troubled bits as ones, we notice that their binary
value is 101, or 5. They are pointing at the culprit.
The efficiency of this example is a little misleading. If we had chosen a
63-bit code we could have done the same thing with 56 data bits and 7 for parity checks.)s
- How do we apply this idea to an array of disks? (Buy 63 disks.
Suppose they are organized in 1 Kbyte sectors. Tell your customers that
they have to write data in 56 Kbyte blocks. When you are handed a 56Kbyte
block, plan to put the first bit on sector one of disk 1, the second bit on
sector one of disk 2, etc., continuing till you get to disk 56. Then
calculate an ECC code of 7 more bits and write them as the first bit
on disks 57 through 63. Now go back to disk one and place bit 57 in
the second bit of the first sector, etc. Of course you can't begin writing
anything until you have laid all the bits out in RAM, but that is
a detail that doesn't affect the idea. [Add RAID 2 to the table])
- What is the problem with this approach? (1. 56 Kbyte blocks are pretty
big; there may not be a lot of applications with that much to write.
2. To read anything you have to launch 56 parallel reads, and wait
for the last one to complete. The read slowdown factor is probably
pinned at 2, and we are really soaking up bandwidth.)
- How does RAID 3 fix this? (By recognizing that
erasure errors are easier to correct than unknown
errors. The RAM ECC is designed assuming that we have no
idea which bit might be in error. But modern disk controllers
are fail-fast: they write each sector of data on the disk
with an Error-Detecting Code (EDC). Then, when that
controller reads back the data it can check the EDC and
return a status report. If the disk controller reports that
a sector contains bad data, we don't need to have the RAID
ECC point to the bad disk, what we need is to figure out what data
value was on that bad disk. One bit of parity for each
bit of a sector is sufficient for that. [Add RAID 3 to the
table])
- So we don't need 63 disks, we can get along with 57? (Better. We
can apply this approach to any number of disks, e.g. 3 data disks and
one parity disk. That way we only demand of the client a data transfer
size of 3 Kbytes.)
- So why doesn't anyone actually use RAID 3? (Because
bit-interleaving would would take extra steps to implement and be fragile,
the designer can gain simplicity and robustness by switching to
sector interleaving. If we call that RAID 3', we get to fill in
another slot in our table, above. There are real, marketable products
using RAID 3', though they are advertised as RAID 3 because that is
the closest thing in this paper and it is easier to market something
with the same label that the paper used.)
- But doesn't sector interleaving make it harder to calculate
the ECC in RAID 2 or the parity bits in RAID 3? (Not really. In RAID
3 all you need to calculate is parity, and that can be done rapidly
with word-wide exclusive-or's. This is easier to see if the exact
bit layout is on the board. For sector-interleaved RAID 2, the code
we described above is also done entirely with exclusive-or's. The
program may look a little complicated, but it should be fast--if you
ever get it debugged.)
- Unfortunately, even RAID 3' doesn't address the problem that we are
forced to read every data disk for any read. How does RAID 4 fix that? (By
realizing that parity works on unrelated data just as well as it does on
related data. We can switch to shallow interleaving or even stop
interleaving completely, to avoid the need to get data from every disk
on every read. The non-interleaved form works as follows: On sector one
of disk one, write a normal 1 Kbyte block of data. On sector one of disk
two, write another normal 1 Kbyte block of data. And do the same on sector one
of disk three. On sector one of disk four, write a block that consists
of the exclusive OR of the three blocks written to sector one of the
other three disks. [add RAID 4 to the table])
- Why does that help? (Now you can get a block of data by launching
a single disk read, say to sector nine of the second disk. The disk
controller will tell you if the data on sector nine is good. If it
isn't, only then do you read all the other sector nines, and calculate
their exclusive OR to reconstruct the sector nine on the disk that you
couldn't read. So this tactic optimizes the common case, when there are
no disk errors.)
- Isn't writing still a mess? And updating a single block of data
even worse? (Not as bad as it looks. You might think in order to
rewrite a single sector you have to read all of the sectors in the group
in order to calculate the new exclusive OR sector. But actually all
you have to do (barring a read error) is read the sector you are going
to rewrite and the parity sector. If you exclusive-OR these two,
you get the same thing you would have gotten by reading the other
sectors and exclusive-ORing them. Then EXOR the new data into the
parity and write the new data sector and the parity sector. Two reads,
two writes, no matter how many sectors are participating in the
parity calculation.)
- How can RAID 5 possibly improve on this? (Another clever
implementation hack. The problem it addresses is that every
write involves a write of the parity disk, so with heavy writing
that disk might become a bottleneck. So don't use the same disk
for parity on every sector. To do this hack, you need to invent a
systematic way of figuring out which disk does hold the parity
for any given sector. One way: With N disks, for sector Z use
disk (Z mod N) as the parity disk. Disk 1 then holds parity for
sectors 1, N+1, etc.)
- And RAID 6? (Totally speculative: apply a full-bore ECC
to protect against multiple simultaneous erasures. The two-dimensional
array version in the paper is just an example of an easy-to-understand ECC. So
far, noone has ever actually found it useful. The industry has gone
off in different directions, inventing different
error-correcting codes that can handle multiple erasures; these are
sometimes referred to as RAID 5+ or RAID 6 in current papers. Any time
you hear a RAID number larger than 5, ask for a definition; there isn't
any standard practice.)
In analyzing performance, the following table may be of some help. It
was assembled by Carl Manning in 1994, comparing the RAID levels.
Note two things:
1. It omits the slowdown factor in places where it doesn't belong, so it
disagrees with the paper in a few places.
2. in the Shadow line of RAID 1 the capacity, bandwidth, and throughput
formulas seem to be wrong; C should be replaced with C+1.
RAID = Redundant Array of Inexpensive Disks
G = number of data disks in a group
C = number of check disks in a group
T = total number of disks (including check disks)
D = total number of disks with data (not including check disks)
N = D/G = number of groups
S = slowdown factor to synchronize disks. 1 < S < 2.
w,h = width,height numbers of data disks in 2d array.
f.t. = fault tolerant
capacity bandwidth max req throughput
------------- --------------------- -----------------------
utili- f.t. read write read write
level arch. G C zation overhd ratio util ratio util ratio util ratio util
-------------------------------------------------------------------------------
ID : 1disk -- - 100% --- 1 100% 1 100% 1 100% 1 100%
AID : Ndisks 1 0 100% --- T=D 100% D 100% D 100% D 100%
RAID1: Mirror 1 1 50% 100% T=2D 100% D/S 1/2S 2D 100% D 50%
Shadow 1 C 1/C C T=CD 100% D/S 1/CS CD 100% D 1/C
RAID2: 10bECC 10 4 71% 40% D/S 71%/S D/S 71%/S N 7% N 7%
25bECC 25 5 83% 20% D/S 83%/S D/S 83%/S N 3% N 3%
RAID3: BitPar G 1 G/(G+1) 1/G D/S D/S N 1/(G+1) N 1/(G+1)
RAID4: SecPar G 1 G/(G+1) 1/G D D/S D G/(G+1) N/2 1/(G+1)
RAID5: RotPar G 1 G/(G+1) 1/G T D/S T 100% T/4 25%
RAID6: 2dPar w*h w+h T D/2S? T 100% T/6 16%
Assume for high bandwidth writes in RAID 4 or 5, can compute parity
for group of sectors, so no read needed.
Shows max, not *expected* throughputs---contention not taken into account.
RAID 1 -- Mirrored Disk: drives can do independent reads, but both must be
written during write.
RAID 2 -- ECC: interleaved at bit level, using error correction codes; all
drives read and write together.
RAID 3 -- Bit Error Detect+Parity: interleaved at bit level, but since
a drive signals read error, only 1 extra drive for parity bit is
needed to correct 1 bit error. All drives read and write together.
RAID 4 -- Sector Error Detect+Parity: interleaved at sector level. Since
a drive signals read error, only 1 extra drive for parity sector is
needed to enable correction. Drives can do independent sector
reads. For writes, both data drive and parity drive must perform
READ(recompute-parity)WRITE together, quadrupling I/O accesses, thus
quartering performance. All drives read together only to correct error.
RAID 5 -- SectorErrDectect+RotatedParity: like raid 4, but rotate
parity sector among drives to enable concurrent writes.
RAID 6 -- Two dimensional parity: like 5, but arrange disks in 2d
array and keep rotated parity sectors for both rows and columns.
3 disks must be accessed for a small write, capping performance at 1/6.
Comments and suggestions: Saltzer@mit.edu