
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
This article is part of my ongoing ‘5 Incident Response Principles for CTOs’ series. Check it out – I’d love to hear your thoughts.
Recently, I described my first incident response principle – the inevitability of incidents. This was an interesting experience for me – for years, I’ve met folks who focus only on building systems that ‘should’ never have incidents and forget about principle 1.
But for those who accept that incidents will happen, the next step is thinking about how to build in technical resilience to mitigate the impact as far as is realistically possible.
Because how badly they hurt and how quickly you recover, has a lot to do with the technical decisions you made long before the incident started. That’s what this article – and Principle 2 – is really about.
As a CTO, your goal isn’t only to prevent failure. It’s to ensure you’re building systems that behave predictably when failure arrives.
Building for graceful degradation
The foundation of technical resilience is architecting your systems to degrade gracefully rather than collapse entirely.
Because architecture is at the core of technical resilience. Well-architected systems split responsibility across domains – following principles like domain-driven design, or the well-architected frameworks offered by AWS and GCP – so that when something fails, the blast radius stays contained.
That means prioritising auto-scaling, load balancing and redundancy. They aren’t luxuries: they’re the difference between a partial outage and a full one; between a P2 and a P1.
It also means designing explicitly for failure: if you have a dependency on a third-party service, the question isn’t whether it will go down, but whether your system is built to handle it when it does. Circuit breaker patterns and similar approaches exist precisely for this. Each incident, in turn, becomes a feedback loop – an opportunity to inspect how the architecture actually behaved under pressure and adapt it accordingly.
Resilience testing and chaos engineering take this further: deliberately introducing failure to verify that the system responds the way you designed it to.
Use simplicity to combat dependency
Equally important is simplicity. Complex architectures create knowledge silos, and knowledge silos create single points of failure – in your people.
This is something we discuss frequently in the Uptime Labs team. Key person dependency, also known as the bus factor (as in, how many members of your team could be ‘hit by a bus’ for the team to lose critical knowledge it can’t recover from) is a more common problem than you might think.
No high-performing team should need to call the one engineer who understands the infrastructure because they’re the only one who does. Likewise, there’s often one engineer who excels at responding to incidents who gets paged frequently – which is a one-way ticket to burnout.
You don’t need everyone to know everything in order to respond effectively; a basic shared understanding of a system’s layers is usually enough.
Moving to open standards helps here too: familiar patterns reduce cognitive load, especially during the moments that matter most. Keep things simple. Design patterns exist for good reason.
Observability is not optional
You can have the most robust architecture in the world and still respond slowly if you can’t see what’s happening. Observability – real-time visibility into your production systems – is the nervous system of your incident response capability.
This means being able to take the signals, interpret them quickly, and act. It also means having an alerting strategy that’s actually calibrated: slow signal means slow response – and slow response is expensive. According to New Relic’s 2025 Observability Forecast, based on a survey of nearly 2,000 engineering and IT leaders, high-impact outages carry a median cost of $2 million per hour. That’s around $33,000 for every minute systems remain down.
The biggest time sinks during outages are almost never the fix itself. They’re the time spent diagnosing what’s wrong. Every minute of confusion in the early stages of an incident is a minute of cost. The investment in observability is straightforward to justify.
Get ahead of your customers
Another marker of technical maturity worth naming: do you always find out about problems before your customers and your business do? If your users are submitting support tickets before your monitoring has fired, your alerting strategy needs work. We’ve all been there, but reactive mode is not a good state to be in.
The gap between ‘we found out first’ and ‘the business told us; is often just a matter of instrumentation decisions made earlier in the delivery lifecycle. This is why during the impact assessment phase of the incident, the observability and alerting capabilities are so important to help with the mean time to detection. Plus, these capabilities help make inform decisions on the impact of the incident (SEV1 or SEV3?).
The technical foundation is the floor, not the ceiling
None of this replaces the human side of incident response (that’s Principle 3 – article coming soon!). But without the technical foundations in place, even the most prepared & psychologically safe team will be fighting with one hand tied behind their back. Get the systems right, and build the people capability on top.
Working on these foundations
Technical resilience isn’t a ‘one and done’ investment. It’s a set of ongoing decisions about how your systems are built, how they’re observed and how well your team understands them.
The PIR or postmortem process is the key engine or flywheel for this practice. As a CTO, that’s where I’m looking to ensure the actions/investment is built back into the system. That’s what enables the continuous development of technical resilience.
So – as CTO, how are you making sure that you’re converting your PIRs/postmortems into technical excellence?





