Reading 05

The Therac-25 accidents are some of the worst software mistakes we know of. I really feel for the engineer who wrote this software (it has been stated as being only one engineer), as he was working on such a complex system that I think many if not most software engineers would have made similar mistakes. While the physical cause of the accidents was the removal of physical fail safes from the Therac-20 to the Therac-25, in theory the replacement of physical fuses with a software fail safe should be an equal system. Therefore, I agree with Nancy Levenson’s analysis that “To attribute a single cause to an accident is usually a serious mistake” (http://courses.cs.vt.edu/professionalism/Therac_25/Therac_1.html). Ultimately, I see two key issues that led to the engineering errors so essential to the Therac-25’s demise: a poor user experience and poor engineering practices.

 

I will first grant that the Therac was created in a time when interface design was not nearly as good as it is now. However, there were two key issues that let this problem fester. The first is that the error code “Malfunction 54” which was presented every time the problem happened, was undocumented. This in turn meant that operators had no way of discovering what had gone wrong, if anything. If there is no way for the operator to investigate, then they will quickly lose interest and brush it off as unimportant, when in reality this error was an important indicator of what had happened to their patients. The second is that no wait screen or interrupt existed to prevent the operator from doing important tasks while the machine was moving its equipment into position. This is of course what the actual error was.

 

At a more scientific, engineering level, the proper protocols were never followed during the testing and verification of this system. Software tests were not implemented, Fault trees never designed or tested. These oversights skipped important steps that likely could have caught the issues that plagued the Therac-25 (integer overflows? Should be easy to check). This is an issue that needs to be a addressed at a company / cultural level. If management does not encourage employees to follow good practices and leave time for them in planning, it will be very difficult to prevent bugs in the product, regardless of the complexity of the end product.

 

Ultimately, it strikes me as interesting that both of the errors that led to accidents (the interrupt during setup and the integer overflow) could have been prevented from causing harm to patients by a simple physical inspection of the machine prior to usage. As the extreme radiation was a result of pieces not being correctly in place, this strikes me as something that would be very easily to visually confirm and thus prevent deaths. While the software obviously needed to be fixed so as to ensure correct operation, a 30 second check seems worth a human life.

 

As a future software developer, I want to strive to make sure that I follow the correct engineering practices. I have learned in my internships the importance of testing, and studying this problem simply re-iterates that concept. Well-designed and documented systems will create safety and efficiency for both me and those that follow me and work on the systems I do. Large projects need to be approached with future maintainability and testing in mind from the beginning. And, if God forbid the worst should happen, it is important that blame not be placed on an individual. Software is a team effort, through better or worse.

Leave a comment