6.033--Computer System Engineering
Suggestions for classroom discussion
Topic: Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P. Thacker. High Speed Switch Scheduling for Local Area Networks. Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. (September, 1992), pages 98-110.
By J. H. Saltzer, March 3, 1997.
Wow, another paper from DEC SRC. We sure read a lot of their papers. Why? (1. They do good work, because they have a lot of smart people. 2. They put a high value on clear writing. There are other places that can claim either 1 or 2, but not very many places that can claim both.)
There are two challenges for an undergraduate reading this paper. First, the big picture is only implied; it isn't explicitly stated. Second, although there is a very elementary idea involved, you don't discover that till you get to section 3; Section 1 is filled with unfamiliar buzzwords and Section 2 is a very intense survey of some extremely complex related work that is quite unimportant to understanding this paper. So, although it isn't obvious, the appropriate first reading of this paper is to try to understand section 1, skim section 2, and read section 3 carefully, all the while discarding any paragraph containing the word "banyan".
Let's start with the big picture.
What kind of "switch" are these people talking about? (A box with 16 ports. Any port can be connected to any other port for the duration of one cell.)
It says that ports are "duplex". What does that mean? (Data can flow either way, in or out.)
Cell? What is that? (A short packet of data. It consists of 48 bytes of data--the payload--and a 5-byte header that tells where the packet is intended to go.)
Is a Cell the same thing as a frame? (No. In isochronous multiplexing, a frame doesn't carry a header; you figure out where the data in a frame is supposed to go by its time position.)
From the outside, how does the switch work? (A cell arrives at a port. The switch examines the header, decides which other port this cell should go to, and it sets up an internal connection from the arrival port to the departure port.
How do I build a local area network out of these switches? (If you have, say, 15 workstations and servers, plug one into each port and write a driver that knows how to divide up your data into cells.)
What if I have 100 workstations and servers? (You have to buy several switches, and plan to use some ports to interconnect the switches.)
I want to install a network that links 140 workstations and servers.
I'm about to write the purchase order. How many 16-port switches
should I order? This question will throw the class into utter,
enlightening confusion.
(The answer depends on the topology you choose, and one can get into a
lot of interesting tradeoff discussions. We clearliy need at least 140
ports. With 16 ports per switch, it looks like you need at least 9
switches. But you have to use some ports to connect those 9 switches
to one another. If you put the switches in a ring you use up 2
ports/switch for switch-to-switch links, leaving 14 workstation ports
per switch, so you had better order 10 switches. In a ring
configuration, any one switch can fail without denying service to
anyone but the workstations directly attached to it.
Another possibility is the "double torus" connection . Arrange the
switches in a square grid, connect each to its four nearest neighbors.
At the edges, there isn't a neighbor on one side, so run a wire to the
corresponding unconnected port at the other end of the same row
(column) of switches.
Now we are using 4 of the 16 ports on each switch for inter-switch
connections, and will need 12 switches, so a 3 x 4 array seems about
right.)
Now let's get to the internals of the switch in this paper. Just what is head-of-line blocking? (Imagine a bank where everyone is required to stand in a single queue. Each person in the queue may be waiting for a different service. The person at the front of the queue is waiting for the loan desk, and the loan desk is tied up in a very complex transaction right now. If the queue is handled as a FIFO, then even if all the deposit tellers are free, and you want to make a deposit, you can't get through, because you have to wait for the person at the front of the queue to be served first.)
Wow, that seems bizarre. Why not use output buffering? (The authors didn't say. Presumably the problem is that an input line delivers 2,000,000 cells/second to the switch, so you have to be very quick on your feet to keep up. Transferring a cell from an input buffer to an output buffer--or even just putting a buffer pointer into an output queue--takes time. This is just a guess; it would have been nice if the authors had told us their reason.)
What do they do to avoid HOL blocking? (queue jumping. If there is a free teller you can go to that teller without waiting for the loan transaction.)
Give an example of a maximal match and a maximum match. (In the figure below, each input has two waiting cells, indicated by the number of the output they want to get to.
cells input output
4 4 - -
3 4 - -
2 4 - -
2 1 - -
If we connect input 1 to output 4, input 2 to output 3, input 3 to output 2, and input 4 to output 1, every input and every output will be occupied, and there is no other way we could arrange things to get more cells through on this clock tick. This is a maximum match.
If we connect input 4 to output 1, input 3 to output 4, and input 2 to output 3, we discover that even though input 1 and output 2 are unmatched, none of the cells waiting at input 1 have any interest in output 2. So we can't add any more links to the picture and we have a maximal match. We can't improve the situation unless we rearrange connections.
What does all this have to do with their parallel iterative maching algorithm? (On each iteration, the request/grant/accept algorithm adds a few connections. Once made, there is no provision for rearranging connections. So they are headed for a maximal, not maximum, match at best.)
Do they make it? (Only if they are lucky. They iterate a few times, then stop.)
Why stop? (Because it is better to ship some data than to sit around all day looking for the best match. The algorithm starts iterating as soon as the first five bytes have arrived, and it continues to iterate until byte 53 is in. Four iterations with the present design. Then it goes with whatever it has got and hopes for the best.)
Does that work? Why should we believe that they have gotten anywhere close to a maximal match in only four iterations? (We shouldn't. They claim that simulations show that they will have a maximal match 98% of the times they try. But they don't tell us much about the simulation, except that it assumes that all input/output pairs are requested with equal probability. Some confirmation from the field would be nice to have, and this paper simply doesn't provide it.)
Comments and suggestions: Saltzer@mit.edu