6.033 Discussion Suggestions (PASS paper)

6.033--Computer System Engineering

Suggestions for classroom discussion of:

Alfred Spector and David [K.] Gifford. The space shuttle primary computer system. Communications of the ACM 27, 9 (September, 1984) pages 873-900.

by J. H. Saltzer, April 25, 1996.


The primary emphasis in this paper is the failure tolerance technology.
A good approach is to get the class to explain what is going on using the
terminology of Gray and Siewiorek.

1.  What is the meaning of the string of words "fail operational/fail
opeational/fail safe"?  (The first failure should leave the system
completely operational.  So should the second one.  The third failure
should not cause the mission to crash.  It is a specific example of
fault-tolerance.)

2.  They installed five identical computers.  Does the system use a
fail-fast or a fail-vote technique?  (You can find evidences of both.
Generally, everything is done by voting, even to the point of having
multiple actuators pushing on the control surfaces--"force-fight voting".
But the computers are continually cross-checking among themselves and
if one thinks the others are not "in sync" it begins ignoring them and
doing its own thing.)

3.  Is this TMR, duplexing, or what?  (Four of the computers are set up
with a particular form of 4MR.  The four together viewed as a unit and
the fifth are a duplex system with software diversity.)

4.  How big is the machine?

(Speed:  0.5 MIPS
 RAM     424 KBytes
 Disk    4 MBytes)

5.  Why so small?  (The configuration was set around 1972.

6.  So why don't they get with it and switch to a Pentium with Windows
'95?  (Because they know this machine works.  Highly reliable systems
often use obsolete hardware and software, because that is the only
equipment that has a long enough track record to measure its
reliability.  When in doubt, make it stout, and of things you know
about.)

7.  How does the MTTF of the computer system compare with the expected
flight duration?  (MTTF = 6000 hours = 250 days.  Typical mission = 10
days.  This is a region of operation not emphasized by Gray and
Siewiorek.)

8.  They urged NASA to "forgo new capability enhancements".  What
principle is this an example of?  (Keep it simple.)

9.  How does failure analysis fit into their overall plan.  (Every
failure, no matter how small, is analyzed, not just to see what failed,
but to see what about the design process allowed that failure to not be
accounted for.  Compare this approach with the banking/ATM approach
reported by Anderson.)

10. What was their hardest technical problem?  (Shared variable conflicts
among parallel threads.)

11.  What is an alibi?  (An English language comment attached to the
declaration of a shared variable that explains why that variable doesn't
need coordination protection.)

12.  How are alibis used?  (Ever design change results in a complete
review of all alibis to see if their assumptions still hold.)

Comments and suggestions: Saltzer@mit.edu