The term “Normal accident” really caught my attention – how can an accident be normal?
Are complex software systems, especially ones with technical debt and many knowledge transfers, destined to have (catastrophic) failures?
Find below some quotes from the related Wikipedia articles (emphasis mine) and the reference.
A system accident (or normal accident) is an “unanticipated interaction of multiple failures” in a complex system.
This complexity can either be of technology or of human organizations, and is frequently both.
A system accident can be easy to see in hindsight, but extremely difficult in foresight because there are simply too many action pathways to seriously consider all of them.
Charles Perrow first developed these ideas in the mid-1980s. William Langewiesche in the late 1990s wrote, “the control and operation of some of the riskiest technologies require organizations so complex that serious failures are virtually guaranteed to occur.“
Safety systems themselves are sometimes the added complexity which leads to this type of accident.
Once an enterprise passes a certain point in size, with many employees, specialization, backup systems, double-checking, detailed manuals, and formal communication, employees can all too easily recourse to protocol, habit, and “being right.” …
In particular, it is a mark of a dysfunctional organization to simply blame the last person who touched something.
Perrow identifies three conditions that make a system likely to be susceptible to Normal Accidents. These are:
- The system is complex
- The system is tightly coupled
- The system has catastrophic potential