
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
Note: This blog post is written from the insights of our latest webinar, ‘How to Deliver Incident Response Training That Improves Recovery, Resilience and Results'. The panellists are Courtney Nash (CEO & co-founder, The VOID), Sarah Butt (Principal Engineer - Centralised Incident Response), Morgan Collins (Incident Management Architect) and Stuart Rimell (Chief Product Officer, Uptime Labs). Many thanks to these experts for sharing their hard-won insights.
1. Incident response is a distinct skill (not just 'ops but faster'!)
The work behind incident response may appear (to some) to be standard engineering/ops. Yet, the cognitive demands and coordination load are distinct.
Morgan noted that people can be excellent engineers yet struggle in incidents, because the skill set only partially overlaps.
So what? You can’t assume good engineers automatically make good incident responders or commanders. It needs its own practice and support.
2. Expertise is built by doing incidents, not just reading about them
You can build knowledge from books, talks and shadowing. But real expertise comes from actually running incidents.
The catch: you have far fewer incidents than daily tasks, so exposure is limited.
So what? Organisations need deliberate practice opportunities (drills, simulations, game days) to accelerate exposure, rather than waiting for production to burn.
3. The ‘law of fluency’: experts can’t easily explain why they’re good
As people become fluent, their skills become automatic and invisible even to themselves.
The downside is that this makes it difficult for them to teach others, and hard for orgs to capture that expertise.
So what? You may benefit from working to surface expert know-how: structured debriefs, incident analysis, coaching, and teaching skills for senior responders.
4. Training usually stops at the ‘beginner’ level
You could divide incident response training into 3 levels:
Beginner – basic environment knowledge, runbooks, who to page, tools.
Intermediate – learning to treat runbooks as a floor, not a ceiling: pivoting, adapting, handling weird situations.
Expert – innovating, shaping the program, spotting systemic weaknesses, building organisational resilience.
Most companies stop at Level 1: 'Here’s the runbook, here’s the tool.'
So what? To get real maturity, try to intentionally support intermediate and expert development, not just onboarding.
5. Repetition isn’t enough; you need learning from repetition
Recognition-primed decision making (RPD) and pattern recognition come from repeated exposure plus reflection.
It’s not ‘do a lot of incidents'. It’s 'do incidents + analyse what happened + contrast with others’ thinking.’
So what? Debriefs and incident analysis are critical, not optional overhead.
6. Culture decides whether adaptation is possible
Experts must deviate from runbooks in complex, novel situations.
If people get punished when an adaptation (made in good faith) goes wrong, everyone retreats to 'just follow the runbook,' even when that will fail.
So what? Effective incident culture:
- Accepts that actions are ‘gambles’ under uncertainty.
- Judges decisions based on information available at the time, not hindsight.
- Actively rewards thoughtful adaptation, not blind compliance.
7. Runbooks are valuable. They won’t save you.
Runbooks/checklists are great for cognitive offload and onboarding.
But in complex systems, no script can cover every situation.
So what? Treat runbooks as scaffolding, not safety nets that guarantee success. The real safety comes from skilled humans who know when and how to depart from the script.
8. Metrics like MTTR don’t tell you if learning is happening
Executives tend to want rolled-up numbers (incident counts, MTTR, etc.), but these don’t capture cognition, coordination, or expertise.
The real question is: ‘Is learning happening?’, not ‘Did MTTR go down by 3%?’
Signals of learning include:
- Strong incident analysis practice.
- Incident reports are being reused in design, planning, onboarding, and training.
- People report that training helps them under pressure, not just in slides.
9. Performance improvement : emphasise insight generation over error reduction
Morgan cites Gary Klein: improvement is about reducing errors and generating insights.
Tech organisations tend to overfocus on error reduction (fewer incidents, shorter outages) and underinvest in insight generation (better understanding, better decisions).
So what? If you don’t explicitly prioritise insight generation (via analysis, reflection, sharing), you cap how good your org can get.
10. Incident analysis is its own high-value speciality
There’s a strong case for dedicated incident analysts:
- They elicit hidden expertise.
- They surface friction and success patterns.
- They help translate incident learnings into org-wide change.
(We worked with incident analyst Eric Dobbs (Principal Incident Analyst, Indeed) for the post 'What Experts See That the Rest of Us Miss During Incidents' - would love to know what you think).
Practical Takeaways
- Treat incident response as its own skill
Don’t assume strong engineers automatically excel in high-pressure incidents. Provide explicit training and clear expectations for the role. - Build practice time, not just theory
Reading runbooks isn’t enough. Utilise simulations, drills and shadowing to provide responders with safe and repeatable exposure to real-world complexity. - Use runbooks as scaffolding, not scripts
They should support responders, not constrain them. Train people to recognise when adaptation is needed - and back them when they make rational decisions under uncertainty. As Sarah said: “There's a tendency to equate easy with good, and that's not the case in these messy, complex incidents." - Invest in teaching skills for your experts
The law of fluency means experts can’t always articulate what they know. Provide coaching frameworks, pair new responders with seasoned ones, and deliberately surface tacit knowledge. - Analyse incidents to extract insight, not blame
Debriefs should focus on understanding decisions, cognitive load, coordination challenges and system behaviour - not judging outcomes with hindsight. - Share learnings widely
Circulate incident reports, incorporate insights into onboarding and design reviews. Ensure executives understand what actually happens during incidents. - Build a culture that supports adaptation
If people fear punishment for good-faith decisions, they will retreat to rigid playbook-following (exactly when flexibility is most needed!). Reward thoughtful experimentation, not neat stories. - Measure learning, not just numbers
MTTR and incident counts won’t tell you if responders are getting better. Look instead for signals such as growing confidence, smoother coordination, increased cross-team awareness and insights emerging in product changes. - Consider a dedicated incident analysis capability
Analysts help turn messy real-world events into organisational knowledge. This is one of the highest-leverage roles for improving resilience.





