
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
Introduction: The ‘Lethal Brew' of Incidents
Two separate readings this week inspired me to write this post:In his 1990 book, ‘Human Error,' James Reason argued that accidents are not solely caused by individual errors, but by a combination of factors, including latent conditions and organisational issues. He famously described frontline operators as adding the 'final garnish to a lethal brew whose ingredients have already been long in the cooking,’ meaning they inherit system defects and ultimately trigger an accident.An article by David Wood, which examines recent aviation incidents, and argues how the government should take action to get aviation safety back on track.The two pieces, all of a sudden, shed new light on my own observations over the years (purely based on personal experience over 15 years) that major incidents come in flocks.Here is what I’ve observed over and over: while( I_Understand_source_of_resilience != false){
You get a series of sev 1s and 0s;
They shake up tech leadership, and everyone is in action mode;
All of a sudden, the voices of engineers on tech debt, cultural debt, and delivery trade-off decision-making are heard;
Time and resource investment (in more productive cases) is allocated to incident analysis, operational excellence initiatives (training, addressing tech debts, war games and chaos tests). Sometimes, also big consultancies get good chunk of that investment (no C-level is fired for hiring the giants);
A period of stability follows, all ranks take the huge achievement for granted and unseen effort is forgotten.
Revert to type;
}
//just to scratch my coding itch If the above resonates with you, you’ll enjoy the rest. If not, please share your experience. Have you observed any particular patterns? Is there a pattern at all?From David Woods’s observations, it appears that aviation followed a similar pattern, albeit on a much longer timescale.I’m not claiming that incidents are predictable. [astonishment vs surprise] You never know when the next major incident will hit you, but I’m convinced that as a technology leader, if you look for signs, you can smell when the lethal brew is brewing. However, it takes a courageous and insightful leader to take action and stop the brew when no one else in the budget approval chain can see what they see.
A handful of smells to sniff for:
- If you’ve a very low number of incidents raised, or small issues are automagically fixed without transparency.
- If post-incident reviews are not written (most likely deprioritised) or written but not discussed openly among your engineers and leaders. When was the last time you were present at an incident review meeting?
- If, as a leader, you don’t know what engineers are moaning about over coffee breaks. The most interesting intel about your system is discussed there; forget leadership meetings.

Dissatisfaction may be expressed over espressos, not debriefs
- If your post-incident review actions/problem records are mostly about bits & bytes/alerting/observability/runbooks/human error.
- If, as a leader, you don’t understand why your systems are running day after day without incident. Recall Woods’s law of fluency, which describes how a lot of the work required to keep things running is hidden.
- If you have just gone through/are going through a huge organisational change.
- If you had a loss of institutional memory (many of the old guard leaving voluntarily or involuntarily)
- If you sense a loss of ownership, that personal pride people have over their code (interesting to observe in the AI coding space).
- If you are not confident that, if a major incident happens in five minutes' time, your team is ready to deal with it effectively.
- If, after a major incident, you hear people murmuring, “I’m not surprised this happened!“.
Even if you don't smell any of the above, you’ll still get an astonishing incident completely out of the blue every now and then. There is no escape from fundamental surprises. But if you smell the above, there is a fertile ecosystem for some difficult incidents.
Finding a Practical Way Forward
I leave you with a few practical steps that you can take today as a tech leader - or if you are an engineer, you can encourage your leader to do so. Either way, it is not easy and takes courage to disrupt the norm, but I promise you that your employer, employees, and peers will be thankful and reward you. After all, success is one step beyond courage!
Actively invest in incident analysis and reviews
Whatever you do, please remember it takes significant effort and time, but I guarantee you it pays off many times over. A bad way of doing it is to add the responsibility of conducting incident reviews on top of everything else that engineers or incident managers have to do. A better way of doing it is to train your staff to do incident analyses, giving them time and space - something else will have to come off the list.
Support regular incident review presentations/discussions that dive deep into an incident review report
As a tech leader, be curious and join the sessions where possible. You do not need to incentivise people to join them; they will because the topic is interesting to them.
Spend time with frontline engineers
Have a coffee chat and ask what's up! Observe what they do, and they are preventing incidents without even realising it. What was surprising for them in the last incident they dealt with?
Rotate engineers between teams
A fresh set of eyes can help identify issues that long-standing staff may overlook. Moreover, breaking down silos will help organisations increase their agility.
Run regular game days
This is an essential step. For example, either a real-life incident will teach you that no one knows where to find the support contract number for your third-party supplier, or you learn that proactively. I don’t recommend the former: it's painful and too embarrassing.
Conclusion
These practical steps can help ‘tip the cauldron’ of lethal brew. You don’t need a heroic intervention. You need the discipline to be curious and the courage to challenge the status quo.And, the truth is, leadership meetings rarely smell the brew. But the engineers probably already have. So, what are you going to do next?
References
https://www.infoq.com/articles/series-enhancing-resilience-2/
https://surfingcomplexity.blog/2021/05/30/dealing-with-new-kinds-of-trouble/





