
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
By far, the most common need that I hear about from customers and partners alike is:
“I want to see consistent high-quality incident response across all my team, regardless of who is responding and where they are located".
It feels like a solvable problem because when we see an experienced person competently and confidently handling an incident, we know it. But it’s hard to replicate it. High-quality incident response can be replicated, but it takes years of experience and many incidents working alongside master responders.
And one reason it’s hard to replicate is that it’s really difficult to measure the quality of performance and skills during incident response. That, in turn, creates new challenges: setting minimum quality standards, developing an effective skills-and-expertise roadmap and establishing a credible certification program to ensure consistency.
To add to the complexity, ‘consistent quality of response’ does not just depend on skills and expertise (note that there is an important difference between the two!). Rather, the entire ecosystem in which incident management occurs has a key role to play, namely organisational culture, technology, and processes.
An example from my real-life experience shows how factors other than skills and expertise come into play:
The setting is a financial trading firm. An incident is raised to flag slow “trade execution confirmation”. This could potentially have a regulatory impact, but at the start, there is no evidence to suggest it.
While the incident manager was establishing the facts, a Head of Department (who happened to be on the internal team mailing list and sitting next to the CEO) mentions the issue to CEO who then panics and calls the CTO. The CTO then feels incompetent because he heard it from his boss rather than his Incident Management team.
From this point, you can imagine where the incident manager’s time and attention are directed, and it is not fixing the incident. All of it happened in the first 5 minutes before the incident manager had a chance to send the initial comms.
The incident gained visibility amongst senior leadership as a badly managed incident with poor communication. Yet the incident manager was a very experienced and reliable incident manager. There’s no question that the incident comms was handled poorly, but not at the fault of the Incident Manager. It was a result of process, culture, and stakeholder education.
For the purposes of this post, I set aside the process, organisational culture, and technology implications and focus only on what it means to ensure that high-quality skills and expertise are consistently deployed in incident response. Two questions immediately jump to mind:
Question 1: How do we measure the quality of incident response?
When I ask what you notice when you notice a consistent high quality of incident response, I get: when, for every single incident, people on the incident project demonstrate confidence, show skills and expertise, and communicate clearly and concisely. Some also explicitly mention seeing evidence of adaptability.
The first challenge is the word ‘consistent’. Incidents, by nature, are a surprise or surprise2. What happens in incidents is mostly unique and inconsistent. Basically, the need is 'managing inconsistent events consistently', a little paradoxical!
I skipped this question for now. It’s a very important question, and we need an answer to define a standard that we can use to measure consistency. It’s a topic that requires thorough research. Stay tuned ;)
Question 2: How can we ensure that all on-call staff have the skills and expertise to run an incident and meet the minimum quality standard (as mentioned in question 1)?
Thankfully, there is at least one precedent to get inspiration from. Aviation is successfully delivering a high level of safety (operating in normal and abnormal situations) using an integrated and coherent system of standards and certification across all elements of its eco-system:
Aviation and healthcare have similarities in their approach to certification. Academic research (John J Norcini) suggests that effective certification assesses the knowledge and skills that truly matter for practice (in educational terms, it has content validity and predictive validity).

When it comes to ensuring adequate human performance, aviation combines training and certification that have content validity and predictive validity. It then reinforces those credentials through continuous learning - recurring checks and simulations.
A similar philosophy guides many medical programs, which rely on frameworks like such as Miller’s Pyramid of Clinical Competence. The framework (“knows,” “knows how,” “shows how,” “does”) is often used to design medical certifications that test not only factual knowledge but also applied skills and decision-making.
Back to our own industry, we are not short of standards and certificates (see Table 1 ), but I wonder why this problem of “delivering high-quality incident response all the time“ is still unresolved?
The other question to explore is how many IT certificates assess students beyond ‘knows’ and ‘knows how’ to do, giving confidence that the holder of the certificate can apply skills in real-life uncertain situations?
What are your thoughts on the applied value of current certificates in real-life situations? Is certification a path forward to achieve a consistent, high-quality incident response?
P.S. It felt wrong that after 25 years in this industry working on mission-critical IT systems, I was not aware of many of these standards. How many of them do you recognise?
Table 1 - Standards governing the IT ecosystem




