Normal accidents
In 1999, organisational sociologist Charles Perrow wrote the masterly book Normal Accidents: Living with High-Risk Technologies. In it, Perrow proposed that certain complex human systems — the Three Mile Island nuclear power station was his prime example, but there were others — are so complex and the interaction of components so tightly coupled that certain modes of operation cannot be anticipated nor, if they happen, effectively stopped before they spin out of control. Therefore, from time to time, systems like this will suffer catastrophic failures. Meltdown at Three Mile Island was unavoidable. It was only a matter of when.
Thanks for reading! This post is public so feel free to share it.
Perrow called such failures “normal” accidents: they do not arise from error, malfunction or malice but from unexpected interactions of system components during normal operation. Normal accidents are an occupational hazard of running the system. Says Professor Perrow:
It is normal not in the sense of being frequent or being expected—indeed, neither is true, which is why we were so baffled by what went wrong. It is normal in the sense that it is an inherent property of the system to occasionally experience this interaction.
If you operate a system like that, you must accept that, even without anyone being seriously “at fault”, the system will occasionally fail in ways you cannot anticipate or, therefore, avoid: such failures are an emergent property of the system’s design:
Though the failures were trivial in themselves, and each one had a backup system, or redundant path to tread if the main one were blocked, the failures became serious when they interacted. It is the interaction of the multiple failures that explains the accident.
These failure modes are not true of simple systems.
Take a bicycle, for example. The “system” is a machine: it has clear boundaries. It is comprised of static components that cannot think, much less change themselves and that interact in linear ways. Absent component failure, we know exactly how a well-designed bicycle will behave. Even if its components fail, we still have a good idea how it will behave: the components fail in predictable ways, and with predictable, containable, outcomes.
But complex systems do not have clear boundaries. They are dynamic. They comprise multitudes — multitudes of autonomous, decision-making agents and volatile substances. Mutlitudes of complex subsystems. None can be delineated. None is static. These systems can, and usually do, change their own configuration over time and depending on a situation, without input from the designer.
Subsystems and agents (and subsystems of agents) “think” for themselves. They have lives of their own. They may behave irrationally or mistakenly. They may interact unpredictably. They are especially likely to do this in edge cases and at times of unusual stress.
These are the times when operators want the system’s designed-in fail-safes and backups to work — but also the time at which they are most likely not to: when they are most likely to impede safe operation.
Failure modes that can’t be anticipated can’t be avoided or designed out. Only once they happen and are recognised for what they are can the system can be redesigned to prevent them happening again.
Non-catastrophic normal accidents
In Normal Accidents, Professor Perrow focused on an unusual subcategory of normal operating modes that cause the catastrophic failure of the whole system they are part of — these are self-destructive failure modes. Examples such as nuclear power stations, airliners, chemical factories and financial institutions.
It is hard not to notice when your failure mode is catastrophic: there is usually a big crater where your system used to be.
But not all “normal” system failures are catastrophic. As long as it seems to be generating good outcomes, a system can be in a non-self-destructive failure mode indefinitely. System operators will happily continue to operate it.
And one of the prime features of a complex system is its “sorcerer’s apprentice” tendency to “misbehave” — to play up; to do something other than what the operator expected. The system theorist John Gall called a complex system’s tendency to antics “systemantics”.
We should expect complex systems to produce results that look acceptable while being subversively insidious. Such “latent” failures may go unrecognised for a long time. Until they are, they cannot be fixed. Until then they are liable to repeatedly throw off bad outcomes.
Asbestos as a case study
Asbestos is a naturally occurring mineral. Humans have used it at least since the Stone Age, but it came into widespread use during the industrial revolution, where its insulating and fire-retardant properties made it valuable. In encouraging its use, the “system” appeared to be functioning well.
By the beginning of the 20th Century the “negative health effects” of asbestos were becoming apparent: the first recognised death was in 1906, and “asbestosis” was first diagnosed as a formal illness in 1924. But its true danger was not fully appreciated. Regulations increased over the middle of the century, but asbestos was only finally banned in the 1980s.
Of course, the catastrophic health effects only manifest years after exposure. Once the health risks of deteriorating asbestos were fully realised its “latent failure mode” was obvious and asbestos was prohibited in new building projects removed, carefully, from existing structures.
Because the construction “system” did not immediately recognise the failure mode, it tolerated (and repeated) the accident. It was not a mistake: it was not an error. There was no malice: this was just an misunderstood bad outcome of the system’s operation.
But asbestos was a fairly central failing in the “construction-industrial complex”. Other “latent normal failure modes” may be peripheral. They may therefore lie dormant for long periods, providing apparently trouble-free system operation, before being triggered.
They may only be set off by unlikely interactions between usually isolated system components.
The Post Office Horizon scandal
The Post Office Horizon scandal is an instructive case in point. Partly following suspicion of endemic financial mismanagement in its branch network, in the late 1990s the Post Office introduced “Horizon”, a state-of-the-art computer accounting system built by Fujitsu, across its UK operation.
As the Horizon system was rolled out, it appeared to confirm management’s worst fears. Up and down the country, a pattern emerged of cash shortfalls in branch ledgers. Based on Fujitsu’s assurances that the system was robust, Post Office management concluded that Horizon data indicated widespread fraud among sub-postmasters managing local post office branches. That this conclusion was unintuitive as a matter of common sense — the sorts of people who act as sub-postmasters tend to be “pillars of the local community” and while there might exceptions, one would not expect sub-postmasters as a group to have common tendencies to fraud — Post Office managers preferred the data they were given and commenced prosecution and enforcement action.
Notwithstanding strident complaints from many sub-postmasters that the Horizon system was malfunctioning, the Post Office held its course. As early prosecutions succeeded, management’s early suspicions appeared vindicated and the Horizon system’s reliability validated. This made subsequent challenges to the system even harder. The sub-postmasters’ complaints increasingly fell upon deaf ears.
In the end, the Post Office prosecuted nearly one thousand sub-postmasters over fifteen years, imprisoning more than two hundred, and convicting and fining hundreds more.
Of course, much later, it turned out the Horizon system was at fault, just as the sub-postmasters had alleged. If not all then the overwhelming majority of prosecutions were outrageous miscarriages of justice.
Tellingly, in most cases the Post Office pursued private criminal prosecutions rather than referring matters to the police or the Crown Prosecution Service. These developed into their own cottage industry, involving teams of investigators, Fujitsu consultants, middle managers, in-house lawyers, as well as external solicitors and barristers. This was a complex system, itself composed of complex subsystems, each operating according to its own private priorities and defending its own interests. This prosecution system became opaque: it was so complex as to be impossible for any single actor to see the whole picture and get an appreciation of how it could be contriving bad outcomes:
Fujitsu employees were incentivised to suppress criticisms of the Horizon system that would put them in conflict with Fujitsu management.
In-house teams, acting on instructions from Post Office middle management, went out of their way to ensure they handled prosecutions expeditiously, shielding upper management from interactions with sub-postmasters regarded as “troublemakers”, and withholding from their management chain mounting evidence that the Horizon system was malfunctioning.
External advisors presented the Post Office’s interests as favourably as was possible in litigation, at times using tactics that stretched — but didn’t quite break — limits of acceptability, delaying and withholding material from defendants that was directly relevant to their cases.
Seen in isolation, on localised assumptions the postmasters were guilty, individual actions within the system were understandable, even if not entirely honourable. Each was insignificant in the wider scheme of things — it would be hard to point to one “bad apple” as being directly causative of a miscarriage — but that is what they caused in the aggregate.
Incentivised loose coupling and “system glitches”
Nothing is perfect, neither designs, equipment, procedures, operators, supplies, or the environment. Because we know this, we load our complex systems with safety devices in the form of buffers, redundancies, circuit breakers, alarms, bells, and whistles. Small failures go on continuously in the system since nothing is perfect, but the safety devices and the cunning of designers, and the wit and experience of the operating personnel, cope with them. Occasionally, however, two or more failures, none of them devastating in themselves in isolation, come together in unexpected ways and defeat the safety devices—the definition of a “normal accident” or system accident.
— Charles Perrow, Normal Accidents: Living with High-Risk Technologies
The Post Office Horizon scandal thus represents something different from what Charles Perrow had in mind: here, the “system” was loosely coupled: many of the independent agents and gatekeepers in the system did have time and opportunity to intervene — the process unfolded over decades — but, because of their incentives and limited view of the broader picture, didn’t. Rather, information emanating from the different parts of the system, all conditioned as it was by incentives, narratives and biases at play had the effect of reinforcing existing preconceptions.
There was a loosely coupled chain-nonreaction here: Because one gatekeeper didn’t intervene, other similarly unsighted gatekeepers took it as read that the coast was clear, everything was in order and so, notwithstanding any of their own scruples, they didn’t need to intervene either. Indeed, doing so might pose more questions than it answered. It might be career-limiting behaviour. The safest course was stick to the original instructions.
Important point: no malice, skulduggery nor conspiracy was required for this outcome. Indeed, that is what is so insidious about the process: each step seems so innocuous. There are no red flags that anything is wrong: the system appears to be functioning normally. If anyone were acting with obvious malice, others in the system would quickly recognise it and adopt a more critical disposition. They may — incentives permitting — even call it out, though history tells us whistle-blowers are routinely ignored: the ruins of many a broken empire are littered with prescient warnings disregarded.
So, terminology check: If a “system accident”, in Professor Perrow’s sense, is “a catastrophic implosion caused by an unexpected chain reaction of tightly-coupled components”, perhaps we should call these latent “non-catastrophic defects in a complex system caused by unexpected non-interactions between incentivised loosely-coupled components” something else: I suggest “system glitches”.
The failure mode here arose not because of uncontrollable tight-coupling, but what you might call “incentivised loose coupling”: where each subsystem’s decision-making discretions were shaped by a limited, and pre-coloured, view of the whole picture. Where needed interventions were suspended by institutional pressures, incentives and inadvertent information filtering processes bearing upon the wider system.
“Incentivised loose coupling” is insidious because the system accidents it creates are often the product of omission as much as commission. They seem preventable: indeed, the system appears designed to prevent exactly the scenarios that arise: it just doesn’t. Many of the affected gatekeepers are only there in the first place as fail-safes to stop this kind of thing happening.
To be sure, there are some misapprehensions here. Executives commonly suppose the inhouse legal team operates, at some level, as the business’s conscience. But not being in the “operational stack”, the legal department does not see the daily flow of business — it is involved by exception — so is poorly positioned to fulfil this role. In any case, delegating business judgment to legal would send a terrible message to the front office: that ensuring prudent business practice was somebody else’s problem.
In any case: any of these gatekeepers could have, and had the system worked as intended, should have stopped the bad outcome. But the system wasn’t out of control the way a melting-down reactor might be. Instead, the system remained defiantly, inexorably, insistently, perversely in control: it was just latently misbehaving. This was classic “systemantics”.
Indeed, it was not the speed but the glacial slowness of the failure that was critical here: the Post Office Horizon debâcle unfolded over fifteen years. Had it happened over a weekend, someone might have noticed something was wrong.
Somebody Else’s Problem fields
“An SEP,” he said, “is something that we can’t see, or don’t see, or our brain doesn’t let us see, because we think that it’s somebody else’s problem. That’s what SEP means. Somebody Else’s Problem. The brain just edits it out, it’s like a blind spot. If you look at it directly you won’t see it unless you know precisely what it is. Your only hope is to catch it by surprise out of the corner of your eye.”
― Douglas Adams, Life, the Universe, and Everything (1982)
Slow-moving system accidents can be hard to spot in a large commercial firm: given the short half-life of role occupancy — like sharks, modern executives are compelled to keep moving, and good ones tend not to be in situ long enough to identify (or take the rap for) slow-burning errors — and the short timeframes of corporate decision-making — success is measured in quarters, not decades — undertakings like the Horizon prosecutions would have seemed almost stationary, and fully controlled, making it harder to recognise them as latent system failures.
Individuals could have been — but likely will not be — forgiven for assuming that, if something was wrong with the information they were being given, somebody else would have surely picked it up. It was, as Douglas Adams put it, “somebody else’s problem”.
The Post Office Horizon Inquiry ran for over three years. It has yet to report. The oddly unsatisfying evidence of the dozens of gatekeepers called as witnesses — Post Office and Fujitsu executives, middle management, investigators, engineers, inhouse lawyers up and down the chain and external lawyers — illustrates the folly of looking for human causes of such a latent system failure. Lots and lots of “SEPs” contributed to the Post Office Horizon debâcle.
Every one of the subsystems worked, within tolerances, according to its own rules of engagement. Any shortcuts and variances that individuals took — overlooking technical glitches and assuming certain fact patterns in cases of ambiguity — were justifiably-intentioned exercises of “wit and experience”, albeit coloured by the priorities and interests driving those individuals.
No particular malice or intentional foul play was needed among any of the operators. No-one privately believed they were prosecuting innocent sub-postmasters.
There are, I dare say, “somebody else’s problem fields” just like these in every organisation on the planet.
There is one other thing to say. Unlike the tightly-coupled normal accidents Professor Perrow had in mind — which, by definition, are extremely rare “tail events”, unprecedented before they happen, painfully obvious when they do, and which tend to “resolve” themselves, by self-destruction, as soon as they occur — loosely-coupled latent system failures need not be rare — see asbestos — by definition will not be obvious and will tend to continue indefinitely until someone notices them.
Latent failures are therefore inherently more likely than catastrophic failures.
So what should we look for if we wanted to find some?
Next time, in our crime and punishment thread: a proposed case of latent system error: the healthcare serial murder cases. Isn’t it a bit weird that so many serial murders are doing exactly the same thing?
This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit jollycontrarian.substack.com/subscribe