
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
It’s been two weeks since the first London Tech Leaders Summit, providing ample time to reflect on a fascinating couple of days in the company of nearly 300 of London’s most engaging technology leaders. The talks, covering topics as diverse as AI-powered cyber threats, unconscious bias training, and the evolution of the CTO role, were excellent. The panel sessions, discussing FOMO, regulatory challenges, and ‘wearing different hats as a leader,’ were great fun and genuinely illuminating.
In addition, in an era dominated by Zoom/Teams meetings and breakout rooms, the highlights for me were the roundtable sessions, where we were able to go deep on tough questions in person as a group.
From a resilience perspective, two round table questions in particular caught my interest. Describing them provides an opportunity to (re)cover some concepts from the world of resilience engineering that may be obvious to some and counterintuitive to others.
“Can you build tech that never breaks?”
The most succinct answer to this question is simply “no” (though this blunt answer may not have been conducive to such an engaging roundtable). Perhaps a more relatable formulation of this question might be: 'Can you build tech that isn’t always a little bit broken?'
Those of us with experience in developing and running real-world software tend to have an intuitive sense that complex systems run in degraded mode. This is to say that most software systems run in a perpetual state of being a little bit broken, requiring constant monitoring and tactical intervention from expert humans to remain stable. Furthermore, much of this effort tends to be invisible and therefore remains unacknowledged (and un-learned from) unless it’s explicitly sought out.
This is a good moment, if ever an excuse was needed to reference Dr Richard Cook’s paper 'How Complex Systems Fail', which, though not originally written about software systems, might as well have been. For many, myself included, this paper has been the gateway to thinking more deeply about resilience.
Additionally, building tech that never breaks, requires us to design solutions or mitigations for all possible failure modes. This, in turn, requires us to be able to imagine all possible failure modes. While many failure modes easily spring to mind (e.g., network, 3rd party, hardware failures) and we implement strategies to guard against them (e.g., failover, redundancy, backup), there remains an infinite number of ways in which systems of common commercial complexity can astonish us. While a DNS failure may be surprising, it is at the very least plausible, whereas a bug in an endpoint detection agent, causing businesses worldwide to experience the blue screen of death, would likely have been astonishing.
So given that the answer to the question “Can you build tech that never breaks?” is a pretty clear “no”, the next question might be, “Given that tech breaks and will continue to break, how might we adapt in a helpful way when it does?” Here lies the critical distinction between reliability, and resilience.
And for those of you thinking about the invincible Nokia 3310 as a counterargument, can we just agree that it’s the exception that proves the rule?
“Can you effectively measure and communicate the financial and reputational costs of downtime?”
The next round table was equally engaging. Facilitated by the excellent folk of New Relic, we were initially challenged to guess the average ‘cost’ of an hour of downtime for companies in the UK and Europe. This question raises further questions about the nature of ‘cost’, which can be considered in purely financial terms, or can be extended to accommodate less directly measurable factors such as reputation, staff engagement and opportunity cost.
New Relic’s recently published 2025 Observability Forecast, surveyed 1,700 IT and engineering teams, reports 'a $2M median cost per hour for a high-impact business outage across surveyed organisations'.
Of course, this is a large number, however one views it, and it lends a sense of urgency and importance to efforts to reduce and to respond effectively to incidents. It also raises questions about the usefulness of ‘average’ statistics, such as mean and median, in summarising non-normal data distributions. Those following the resilience engineering space over the past few years will be familiar with the dismantling of MTTR as a useful metric for measuring and tracking changes in incident response effectiveness, and average measures of impact suffer from the same statistical challenges. Nevertheless, this reality shouldn’t detract from the valuable insight that outages can be immensely impactful to the bottom line (as well as to less measurable organisational concerns).
Thanks London Tech Leaders
These deep, open conversations are what make events like these so valuable for tech leaders and practitioners. Many thanks to Luke Wilde, William Campbell, Robyn Davies, and David Crawford for making the inaugural London Tech Leaders Summit such a pleasure and such a success.



