Mission Impossible? - Consistently Dealing With Surprise Inconsistencies

Hamed Silatani

July 10, 2025

Taggs:

IN THIS ARTICLE

Incident Timeline

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Book a demo

Explore the platform

By far, the most common need that I hear about from customers and partners alike is:

“I want to see consistent high-quality incident response across all my team, regardless of who is responding and where they are located".

It feels like a solvable problem because when we see an experienced person competently and confidently handling an incident, we know it. But it’s hard to replicate it. High-quality incident response can be replicated, but it takes years of experience and many incidents working alongside master responders.

And one reason it’s hard to replicate is that it’s really difficult to measure the quality of performance and skills during incident response. That, in turn, creates new challenges: setting minimum quality standards, developing an effective skills-and-expertise roadmap and establishing a credible certification program to ensure consistency.

To add to the complexity, ‘consistent quality of response’ does not just depend on skills and expertise (note that there is an important difference between the two!). Rather, the entire ecosystem in which incident management occurs has a key role to play, namely organisational culture, technology, and processes.

An example from my real-life experience shows how factors other than skills and expertise come into play:

The setting is a financial trading firm. An incident is raised to flag slow “trade execution confirmation”. This could potentially have a regulatory impact, but at the start, there is no evidence to suggest it.

While the incident manager was establishing the facts, a Head of Department (who happened to be on the internal team mailing list and sitting next to the CEO) mentions the issue to CEO who then panics and calls the CTO. The CTO then feels incompetent because he heard it from his boss rather than his Incident Management team.

From this point, you can imagine where the incident manager’s time and attention are directed, and it is not fixing the incident. All of it happened in the first 5 minutes before the incident manager had a chance to send the initial comms.

The incident gained visibility amongst senior leadership as a badly managed incident with poor communication. Yet the incident manager was a very experienced and reliable incident manager. There’s no question that the incident comms was handled poorly, but not at the fault of the Incident Manager. It was a result of process, culture, and stakeholder education.

For the purposes of this post, I set aside the process, organisational culture, and technology implications and focus only on what it means to ensure that high-quality skills and expertise are consistently deployed in incident response. Two questions immediately jump to mind:

Question 1: How do we measure the quality of incident response?

When I ask what you notice when you notice a consistent high quality of incident response, I get: when, for every single incident, people on the incident project demonstrate confidence, show skills and expertise, and communicate clearly and concisely. Some also explicitly mention seeing evidence of adaptability.

The first challenge is the word ‘consistent’. Incidents, by nature, are a surprise or surprise². What happens in incidents is mostly unique and inconsistent. Basically, the need is 'managing inconsistent events consistently', a little paradoxical!

I skipped this question for now. It’s a very important question, and we need an answer to define a standard that we can use to measure consistency. It’s a topic that requires thorough research. Stay tuned ;)

Question 2: How can we ensure that all on-call staff have the skills and expertise to run an incident and meet the minimum quality standard (as mentioned in question 1)?

Thankfully, there is at least one precedent to get inspiration from. Aviation is successfully delivering a high level of safety (operating in normal and abnormal situations) using an integrated and coherent system of standards and certification across all elements of its eco-system:

Aviation and healthcare have similarities in their approach to certification. Academic research (John J Norcini) suggests that effective certification assesses the knowledge and skills that truly matter for practice (in educational terms, it has content validity and predictive validity).

‍

‍

When it comes to ensuring adequate human performance, aviation combines training and certification that have content validity and predictive validity. It then reinforces those credentials through continuous learning - recurring checks and simulations.

A similar philosophy guides many medical programs, which rely on frameworks like such as Miller’s Pyramid of Clinical Competence. The framework (“knows,” “knows how,” “shows how,” “does”) is often used to design medical certifications that test not only factual knowledge but also applied skills and decision-making.

Back to our own industry, we are not short of standards and certificates (see Table 1 ), but I wonder why this problem of “delivering high-quality incident response all the time“ is still unresolved?

The other question to explore is how many IT certificates assess students beyond ‘knows’ and ‘knows how’ to do, giving confidence that the holder of the certificate can apply skills in real-life uncertain situations?

What are your thoughts on the applied value of current certificates in real-life situations? Is certification a path forward to achieve a consistent, high-quality incident response?

P.S. It felt wrong that after 25 years in this industry working on mission-critical IT systems, I was not aware of many of these standards. How many of them do you recognise?

Table 1 - Standards governing the IT ecosystem

IT-Ecosystem Area	Core Standard / Framework (↑ = mandatory regulation)	Typical Certifications that “plug in” to, prove, or operationalise that standard	Org / People
Infrastructure reliability (data-centre / facility)	Uptime Institute Tier Standard	Tier Certification of Design Documents / Constructed Facility (TCDD/TCCF) • Accredited Tier Designer (ATD)	Org / People
Infrastructure reliability (data-centre / facility)	ANSI/TIA-942 ; ISO/IEC 22237 ; BICSI 002	TIA-942 Data-centre Conformity Certificate • Certified TIA-942 Design Consultant (CTDC) • ISO 22237 Certificate of Conformity • EPI CDCE/CDCX/CDCSP	Org / People
Resilient system design & security controls	ISO/IEC 27001 + 27002	ISO 27001 ISMS Certificate • ISO 27001 Lead Auditor / Lead Implementer • CISSP, CISM, CISA (map to 27002 control domains)	Org / People
	NIST SP 800-53 / Risk-Management Framework	FedRAMP / StateRAMP ATO • (ISC)² CAP (Certified Authorization Professional)	Org / People
	CIS Critical Security Controls	CIS Controls Practitioner Certificate • CIS SecureSuite Membership	— / People
	NIST SP 800-160 (Cyber-resilient engineering)	(guidance only – no formal cert)	—
Data protection / availability of information	ISO/IEC 27701 (Privacy ISM)	ISO 27701 Certificate • ISO 27701 Lead Auditor	Org / People
	PCI-DSS ↑	PCI-DSS ROC / SAQ compliance • QSA, ISA, PCIP (individual assessor creds)	Org / People
	SOC 2 (Trust Services – Security & Availability)	SOC 2 Type II Attestation • CPA / SOC Practitioner credentials	Org / People
Risk governance & cyber-resilience management	ISO 31000 ; ISO/IEC 27005	ISO 31000 Lead Risk Manager • ISO 27005 Risk Manager	— / People
	COBIT 2019	COBIT Foundation / Design & Implementation certs	— / People
	NIST Cyber-security Framework 2.0	(framework – often mapped to ISO/NIST controls; no direct cert)	—
	ISACA CRISC	(maps to ISO 31000 / COBIT risk domains)	— / People
Incident response & business continuity	ISO 22301 (BCMS)	ISO 22301 Certificate • ISO 22301 Lead Auditor / Implementer • DRI CBCP/ABCP	Org / People
	ISO/IEC 27035 ; NIST SP 800-61	GIAC GCIH, EC-Council ECIH • MIM® Professional / Expert • Blackrock 3 Incident-Mgr levels • MIM Operational Certification (whole IR function)	People / Org
	ITIL 4 – Service Continuity & Incident Mgt.	ITIL 4 Foundation / Managing Professional	— / People
Regulatory resilience & sector rules	↑ GDPR (Art 32) ; ↑ NIS2 ; ↑ EU DORA (2025)	GDPR, NIS2, DORA “compliance attestations” (external audits) • Emerging DORA Practitioner courses	Org / People
	↑ HIPAA Security-Rule (US)	HIPAA Compliance Audit • AHIMA CHPS	Org / People
	↑ FISMA / FedRAMP (US Gov)	FedRAMP Moderate / High ATO • CMMC Assessor (for contractors under CMMC 2.0)	Org / People
	↑ CMMC 2.0 (Level 1-3)	CMMC Certification (Level 1-3) • Certified CMMC Professional / Assessor	Org / People
	↑ NERC CIP (North-American grid)	NERC-CIP Compliance Audit • GridSec/ICS-focused incident-handler courses	Org / People
	↑ APRA CPS 234 & CORIE (AU) ; MAS TRM (SG)	CPS 234 Compliance Assessment • CORIE intelligence-led red-team exercise attestation • MAS TRM “notice of compliance”	Org / People

Hamed Silatani

Hamed is the co-founder and CEO of Uptime Labs. He has 20 years of experience in engineering leadership, reliability engineering and IT operations. Having spent the majority of his career at the sharp end of incident response in financial services, he's looking to help all companies master the unexpected.