6.033 Reports - Spring 2003

6.033: Computer System Engineering - Spring 2003

The need for redundancy in safety-critical systems

Rohit Rao

The creators of the Therac-25 radiation therapy machine had to weigh a number of design tradeoffs when developing their system. In building the Therac-25 system, developers forsook more reliable hardware interlocks and instead used software to catch and correct problems, a decision that ultimately led to catastrophic system failures.

The Therac-25 system relied almost exclusively on software interlocks to ensure safe operation of the machine. The software program running on the main computer had the responsibility of continuously monitoring the state of the system and halting operation if it detected a problem. In comparison, older machines such as the Therac-20 relied more on hardware backups to help regulate safety. The developers of the Therac-25 omitted these backups in favor of what they thought was a more reliable method of policing the system. This tradeoff decision, however, is risky to make. Software has historically been a weak point in all computer systems. A large contributor to this weakness is the sheer complexity of most software programs; even small programs can contain hundreds of thousands of lines of code. Testing these programs becomes equally complex, for it is very difficult to test the behavior of a program in situations that are inconceivable to the developer. With this in mind, relying exclusively on software to maintain and regulate a system can have catastrophic and unforeseeable effects.

Hardware, on the other hand, can be equally as unreliable as software. Unlike software, hardware is susceptible to errors caused by repeated use. Over time, hardware components have a tendency to degrade and fail, and the consequences of these hardware failures can be as serious as software problems. The key to ensuring safety, therefore, lies in redundancy. The concept itself is simple: several redundant subsystems should independently monitor the safety-critical components of a system. If one fails, therefore, another can step in to prevent a possible hazard.

An example of this type of redundant setup can be found in the Therac-20 system. An analysis of the Therac-20 system showed that it contained several of the same design and software flaws as the Therac-25. However, the Therac-20 had a significantly lower rate of catastrophic failures. Though the software failed occasionally on both systems, the Therac-20 had additional hardware interlocks that prevented the electron beam from turning on and harming anyone.

In the end, relying too much on any single subsystem is a design practice that can potentially be very dangerous. One of the most important steps of the design process is determining which subsystems require redundancy and which subsystems can function well enough without it. Though implementing a number of redundant subsystems can greatly increase the cost of manufacturing a product, it is almost essential to do so when dealing with safety-critical mechanisms.