Justin Pun
Recitation Instructor: Sam Madden
Section: TR 11-12
The Contributions of Complex System Development to the Therac-25 Accidents
Atomic Energy of Canada Limited (AECL) demonstrated a very poor understanding of the necessary precautions and required steps to create a successfully operational complex system. Many mistakes were made in the process of constructing the Therac-25, which led to the radiation overexposure accidents between 1985 and 1987. Serious flaws in the design and testing stages of the development process were direct causes of the accidents.
In the design phase, AECL committed several poor design practices, ranging from software structure to user interface design. One of the most serious design flaws was the substitution of the traditional electromechanical interlocks for software checks to ensure that the turntable and attached equipment are in the correct position when treatment is started. Generally, when adding software control to a hardware system with existing safety mechanisms, one should not remove the existing safety mechanisms. The software should add safety mechanisms that complement or extend the existing safety mechanisms, especially in complex systems where safety is a top priority. Thus, for a mechanism such as the turntable, which could result in subjecting patients to lethal doses of radiation if not in the proper position, ensuring its correct orientation should be of highest importance. By removing the electromechanical interlocks, AECL placed nearly full responsibility for safety on software. Due to the flaws in the software that AECL was not initially aware of, patients were exposed to lethal levels of radiation that they otherwise would not have been if the electromechanical interlocks were still in place. In the case of Therac-20, where some of the same software bugs existed, there were no instances of lethal radiation exposure because the hardware verification mechanism was still in tact.
In the
testing phase, AECL did not follow any commonly accepted testing
scheme. Most of the testing was done at the system level. Unit
testing and software testing were very minimal. Software should
always be subjected to an extensive testing regimen. Unit testing,
integration testing, system testing, and regression testing should
always be performed. Applying all of these testing methods ensures
that the individual parts, as well as the system as a whole, work
correctly. Any one of these methods alone is inadequate for
ensuring that the software will work to specification. By not
directly testing their software as a whole or by individual
modules, AECL did not ensure the proper functionality of their
software and all of its parts under different scenarios and
operational conditions. An
The mistakes made in the design phase allowed the possibility of radiation overexposures to occur. The mistakes made in the testing phase resulted in the overlooking of serious design problems and software bugs. Together, the problems in both of these phases allowed the fatal accidents that took place between 1985 and 1987 to occur. If the mistakes made in either one of these stages were not made, the accidents would most likely never have happened.