5 Incident Response Principles for CTOs

Joe Mckevitt
|
February 26, 2026
Taggs:
Blog
Incident Management
Outages
IN THIS ARTICLE

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Most organisations don’t truly feel the importance of reliability until they’ve lived through a period of instability – a run of incidents that comes in threes, as the joke goes. That’s usually when it becomes viscerally, painfully clear how central stability is to the core business. And as a CTO, it’s also when you feel the full weight of accountability.

So here are the principles I come back to, in every CTO conversation, when the topic turns to operational resilience. I am both a CTO at Uptime Labs and I’ve worked under CTOs – and it’s taken many many incidents, mistakes and discussions to refine these principles. Hopefully, these principles provide some food for thought.

Principle 1: Incidents Will Happen

Accept it: incidents are inevitable. You do your best across the software delivery lifecycle – you put in the checks and safeguards – but incidents will still occur. So the question is never whether they’ll happen. It’s whether you’re prepared when they do.

We also say – hope is not a strategy. Relying on a single heroic engineer to ride in and save the day is not a strategy. What you need is a playbook – and a culture – built through deliberate, repeatable practice that means your team is ready to respond effectively, every time.

Principle 2: Having The Technical Foundations is Important

On the technical side, resilience means building systems that can gracefully degrade. Auto-scaling, load balancing, architectural simplicity – these all matter. When everyone on your team understands how the system works, there’s no secret sauce that only one person holds (as Karan points out, there’s no need for everyone to know everything – a basic understanding is fine). Hence, no need to get Brett from the Phoenix Project on a call, because he’s the only one who understands the infrastructure.

Equally important is observability. You need to see what’s happening in production in real time, interpret the signals quickly, and act. Alerting strategy matters too – slow signal, slow response.

The biggest time sinks during outages are almost never the fix itself: they’re the time lost diagnosing what’s wrong. The stakes are significant: New Relic’s 2025 Observability Forecast, based on a survey of nearly 2,000 engineering and IT leaders, found that high-impact outages carry a median cost of $2 million USD per hour – or around $33,000 for every minute systems remain down. Ideally, you’re finding out about a problem before your customers or your business does.

Principle 3: The People Side Is Just as Critical

Yet, at the same time, technical architecture alone won’t save you. As my colleague and friend Stuart pointed out, incidents are fundamentally different from routine work. They’re ambiguous, time-pressured and demand an entirely different kind of thinking. The skills and preparedness of the people on call are equally important. When an incident hits, how does your team respond? Are they calm and clear about their roles?

There are two critical concepts at play here: psychological safety and a blameless culture.

Psychological safety means people feel comfortable surfacing problems without fear. As a leader, I reinforce this constantly – not just during incidents, but especially in the quiet periods between them. This might be a technical problem.

This psychological safety has to be an operating mode that everyone understands. It’s safe to say something went wrong. What matters is how we adapt.

A blameless culture is closely linked. When an incident occurs, the question isn’t “whose fault was it?” It’s “what can we learn, and how do we improve?” Clear leadership and role clarity during an incident help people answer the question calmly, rather than defensively.

That shift in framing changes *everything*. Leaders at every level need to model it.

Principle 4: Measure the Learning, Not Just the Downtime

This is a principle I genuinely believe in: mature organisations don’t just track downtime – they measure learning.

Every incident is a learning opportunity. The actions and improvements that come out of post-incident reviews should be tracked, reported on, and fed back into the software delivery lifecycle – addressing process gaps, architectural weaknesses, and yes, skill gaps too. Could communication have been clearer? Were roles well-defined? Did the team make decisions confidently?

There’s an important distinction worth naming here, our friends over at Adaptive Capacity Labs articulate well: learning and fixing are not the same thing. When post-incident reviews focus narrowly on identifying fixes – patching the specific thing that broke – they often miss the deeper value.

Crisis has a way of revealing truths about complex systems that only become visible under pressure. Developing a richer understanding of how and why an event unfolded consistently yields a greater return. Better fixes tend to follow from that deeper understanding, not the other way around.

Resilience loop infographic

As CTO, part of my role is watching these learnings accumulate and making sure they actually get actioned. It’s easy for remediation work to get deprioritised in favour of the next delivery feature. That’s a trap. The CTO needs to sponsor this work and hold the line.

Principle 5: Incidents Are Leadership Moments

When an incident happens, how you respond tells your team everything about the culture you’re building. Are you calm? Curious? Focused on solutions rather than blame? Do you communicate clearly?

You set the bar. Your team will take their cues from you. So use these moments to demonstrate the culture you want – one that’s safe and always looking to improve. The best way to make sure your team is ready is to practise under realistic pressure before the real thing hits – building the muscle memory that holds when it matters most.

One more thing worth considering: who actually reads your post-incident write-ups? Research by Adaptive Capacity Labs suggests that the goal of incident analysis should be to build the richest possible understanding of an event for the broadest possible audience — not just the engineers who responded, but colleagues across the business who might encounter similar situations, and engineers who haven’t even joined yet. If your write-ups are only ever read by the on-call team, you’re leaving most of the learning value on the table. The best incident write-ups become institutional memory.

Of course, you can’t force people to read the write-ups. But you can encourage a company culture where everyone’s invested in operational resilience.

Nothing is more frustrating than the same incident recurring because lessons weren’t acted on. Remember: Runbooks alone won’t save you. What saves you is a team that has genuinely internalised how to respond.

Get the learning done. Make the system more resilient each time. And make sure your people are more capable each time, too.

Summary

Summary - incident response principles

These five principles, bundled together, are a solid starting strategy for operational resilience. They’re what I find myself coming back to in every CTO conversation about what it means to be accountable for a reliable system.

Joe Mckevitt

Joe is the co-founder and CTO of Uptime Labs. A passionate technologist and developer, he has 17 years’ experience in building and scaling high-performing products and teams. Also a marathon runner, he’s wired for high performance. He loves creating cultures of constant innovation, and coaching people to develop their full potential.

Share this post

Ready to make incident response your competitive advantage?

— Chris Voss

See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.