Alfred Spector and David [K.] Gifford. The space shuttle primary computer system. Communications of the ACM 27, 9 (September, 1984) pages 873-900.
by J. H. Saltzer, April 25, 1996.
The primary emphasis in this paper is the failure tolerance technology. A good approach is to get the class to explain what is going on using the terminology of Gray and Siewiorek. 1. What is the meaning of the string of words "fail operational/fail opeational/fail safe"? (The first failure should leave the system completely operational. So should the second one. The third failure should not cause the mission to crash. It is a specific example of fault-tolerance.) 2. They installed five identical computers. Does the system use a fail-fast or a fail-vote technique? (You can find evidences of both. Generally, everything is done by voting, even to the point of having multiple actuators pushing on the control surfaces--"force-fight voting". But the computers are continually cross-checking among themselves and if one thinks the others are not "in sync" it begins ignoring them and doing its own thing.) 3. Is this TMR, duplexing, or what? (Four of the computers are set up with a particular form of 4MR. The four together viewed as a unit and the fifth are a duplex system with software diversity.) 4. How big is the machine? (Speed: 0.5 MIPS RAM 424 KBytes Disk 4 MBytes) 5. Why so small? (The configuration was set around 1972. 6. So why don't they get with it and switch to a Pentium with Windows '95? (Because they know this machine works. Highly reliable systems often use obsolete hardware and software, because that is the only equipment that has a long enough track record to measure its reliability. When in doubt, make it stout, and of things you know about.) 7. How does the MTTF of the computer system compare with the expected flight duration? (MTTF = 6000 hours = 250 days. Typical mission = 10 days. This is a region of operation not emphasized by Gray and Siewiorek.) 8. They urged NASA to "forgo new capability enhancements". What principle is this an example of? (Keep it simple.) 9. How does failure analysis fit into their overall plan. (Every failure, no matter how small, is analyzed, not just to see what failed, but to see what about the design process allowed that failure to not be accounted for. Compare this approach with the banking/ATM approach reported by Anderson.) 10. What was their hardest technical problem? (Shared variable conflicts among parallel threads.) 11. What is an alibi? (An English language comment attached to the declaration of a shared variable that explains why that variable doesn't need coordination protection.) 12. How are alibis used? (Ever design change results in a complete review of all alibis to see if their assumptions still hold.)