Incidents *Will* Happen. Are You (Actually) Prepared?

Joe Mckevitt
|
April 2, 2026
Taggs:
AI & Automation
Best Practices
Blog
Incident Management
IN THIS ARTICLE

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Setting the scene

Sometimes, when I’m talking to someone, I detect a strong air of scepticism that there’s an incident on their horizon. After all, they haven’t had an incident for a long time, which means their people, processes and technology are sufficiently resilient, right?

I’m not here to burst anyone’s bubble, but one of my core beliefs is that incidents will happen. When – and how severe they are – is unknowable. But they’re going to happen and you’ve got to be ready. Remember that Scouts saying – failing to prepare is preparing to fail?

Prevention and preparation are not the same thing

Most organisations don’t have a preparedness problem on paper. They have runbooks. They have on-call rotas. They have post-mortem templates decaying in a Confluence page somewhere. What they often lack is the honest acknowledgement that underpins all of it: incidents are not edge cases.

That shift in mindset, from hoping incidents won’t happen to genuinely preparing for when they do, is the foundation of operational resilience. Everything else then builds on top of that.

Engineering preparation

A significant amount of engineering effort goes into prevention: code reviews, release gates, test coverage, careful change management. All of that is right and necessary. But prevention-only thinking carries a hidden assumption, that the system can be made reliable enough to avoid failure altogether. In complex distributed systems, under real-world load and real-world change, that assumption doesn’t hold.

The organisations that handle incidents well have internalised something different. They still invest in prevention. But they invest equally in preparation, because they know the two are not substitutes for each other.

The cost of getting this wrong is significant. When an organisation is not genuinely prepared, the incident itself is rarely the biggest problem. The chaos around it is. Wrong people scrambling for context. Communication breaking down. Time lost not to the fix, but to the confusion before it. The technical issue becomes a secondary concern to the organisational one.

Hope is not a strategy

There is a version of incident response that many organisations quietly rely on: the heroic engineer (think Brent from The Phoenix Project) aka The person who knows the infrastructure inside out, who gets called at 2am, who somehow holds everything together. That might work once or twice but it definitely is not a ‘strategy’.

A strategy is a playbook, built through deliberate and repeatable practice, that means your team is ready to respond effectively regardless of who is on call that night. It means the system’s behaviour under failure is understood in advance, not discovered in the middle of a live incident. It also means your team has practised the communication and decision-making that incidents demand, before a real one tests them.

As covered in my main article 5 Incident Response Principles for CTOs, this is the starting point for the whole framework. The principles that follow, covering technical foundations, people and culture, learning loops and leadership, only make sense once this first principle is internalised.

The mindset that changes everything

Accepting that incidents will happen is not pessimism. It is the precondition for building something that can handle them well. It moves the question from how do we stop this happening? to how do we make sure we are ready when it does?

That is a more honest question. And it leads to much better answers.

The role of AI

Using AI in the execution path of incident response is inevitable; It’s clearly where things are heading. But its introduction also adds another layer of complexity to how incidents emerge and evolve.

As a result, the nature of incidents will change. Failure modes will not disappear, but they will take on different shapes and surface in less familiar ways. What that looks like in practice is still emerging, and I am actively trying to understand where those new edges are and how they behave under real conditions.

But increased complexity does not reduce the need for preparation. It reinforces it. When those unfamiliar incidents occur, they are unlikely to be neatly resolved by automated workflows, and the responsibility will still fall to human responders to make sense of them.

So while the tooling is changing, one thing is not: Incidents will still happen.

And the core principles of good incident response still apply. In my opinion, the skills that matter today will likely become even more important.

Preparation, decision making, communication, and the ability to operate under pressure.

But I’ll revisit this in a future blog once there’s more evidence and AI Ops matures. Still, the direction of travel is already clear.

In short..

The starting point is simple. Accept that incidents will occur. From there, be proactive. Be ready. Execute as effectively as you can to recover when they happen. And most importantly, take what you learn and feed it back into your organisation to drive continuous improvement.

That is how resilience is built in practice. Not by avoiding failure, but by getting better at handling it every time it happens.

Joe Mckevitt

Joe is the co-founder and CTO of Uptime Labs. A passionate technologist and developer, he has 17 years’ experience in building and scaling high-performing products and teams. Also a marathon runner, he’s wired for high performance. He loves creating cultures of constant innovation, and coaching people to develop their full potential.

Share this post

Ready to make incident response your competitive advantage?

— Chris Voss

See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.