Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
Somewhere in a parallel universe…
It’s Monday morning. An emergency meeting has been called. The team are sitting in silence around the boardroom table. Several employees have logged in remotely via Zoom, appearing awkwardly supersized on the obnoxiously large boardroom VC screens. Besides cursory “g’mornings”, nobody speaks.
The CTO strides purposefully into the room, claiming the seat at the head of the table, while developers stare anxiously into their laptop screens, studiously dodging eye contact. Predictably, the CTO is the first to speak, “As of 9am this morning, it’s been an entire month without downtime. Which team member is responsible for this extended run of stability?”
Silence.
He continues, “We need to understand the root cause. Was this technical stability or a case of human success? You all know that we have a blameless culture in this company, but we need to understand what happened. We need to hold each other accountable for our outcomes.”
Meanwhile, back on planet Earth…
Pretty bizarre huh? This never happens. Why would anyone go hunting for the root cause of a successful scenario, or suggest that the responsibility for such a success might lie with an individual and their actions?
Many of us will have experienced similar situations however, albeit following more negative events. An unexpected scenario, perhaps an outage, followed swiftly by a hunt for the root cause. Occasionally it’ll transpire that somebody did something (or didn’t to do something) which, with the benefit of hindsight was deemed to have been an error.
A human error.
Why Success and Failure Are Both System Outcomes
If we’re honest, a diagnosis of human error can bring a comfort of sorts, because it provides a sense of closure, control, and simplicity in an otherwise complex and uncertain situation. It lends credence to the illusion that the socio-technical system within which we work is inherently safe and reliable, only failing when aberrations occur which conflict with its otherwise optimal, intended design.
So on the one hand, it feels natural to trace system failure back to (and no further back than) a single human action, but on the other it feels decidedly un-natural to attribute success to a single human action.
I’d like you to consider for a moment that it’s unhelpful to do either.
Understanding Human Behaviour Within Complex Systems
“Asking what is the cause (of an accident), is just as bizarre as asking what is the cause of not having an accident. Accidents have their basis in the real complexity of the system, not their apparent simplicity. “ - Sidney Dekker.
Sidney Dekker, in his book, ‘The Field Guide To Understanding Human Error’ described Human Error as “not a cause but a symptom of deeper system issues”. Just as it feels natural to attribute success to ‘the system’, so it is true that human actions, which may be later described as errors, also emerge from the system as symptoms, offering a tantalising window into the reality that most complex systems are intrinsically hazardous and often run in a degraded state.
Dekker continues, “Instead of blaming individuals, we should examine the conditions that made the error possible”. The fact is, the action that did happen was the action which (from a universe of possible actions), was the one that seemed the most appropriate at the time. Understanding how this was so is more important than the action itself. While a human action (or lack thereof) may have been the proximate cause, it was inevitably just one of many contributing and interacting factors emerging from the imperfect system. So attributing an incident to human error is genuinely just as illogical as attributing success to a single factor amongst many.
The Misunderstanding of Blamelessness and Accountability
And so follows the idea of a blameless culture, with blameless post-mortems etc. The pointy finger of blame naturally follows a diagnosis of human error, and there’s nothing that inhibits learning quite like the fear of blame, and its associated consequences. So the more one can internalise the the idea of human error being a symptom of the system, the more open and honest one can be about what actually happened rather than being inhibited by the fear of blame.
There is occasional pushback on this topic. It often takes the form of, “We’ve taken this blameless culture thing so far that people are afraid of even mentioning the actions of individuals during post-mortems.” Or, “It’s led to a lack of accountability.” This is understandable, and is itself another phenomenon that emerges from wider system conditions. This can happen if teams have adopted a practise of blamelessness, but haven’t fully grokked the idea that the purpose is not to avoid accountability. Rather, the goal is to create conditions under which teams are able to provide a full, unfiltered account of what actually happened, rather than what should, might or could have happened.
References
How complex systems fail - Richard Cook
The Field Guide to Understanding Human Error - Sidney Dekker
Further Reading
Two Views on Human Error - Lund University - Human Factors and Systems Safety - Dr. Johan Bergström
From Safety I to Safety II - A White Paper - E Hollnagel



