
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
This is Principle #4 out of my 5 Principles for CTOs blog series.
When I think about measuring learning rather than just downtime, this principle really splits into two parts for me. They're inseparable, and as a CTO I'm accountable for both. There's also a third lens I want to bake into every article in this series from here on - the AI lens - which I'll come to at the end.
1. Create the space for a meaningful, blameless review
The first part is cultural. It's about setting up a blameless environment where the team has the space and the safety to do a genuinely deep post-incident review. Not a tactical paper exercise where we rush through, list a few actions and move on. I mean a proper investment in the review itself, so people can do an excellent job of it.
Some of our best decisions have come out of the most painful incidents - precisely because we took the time to pause, reflect and dig deep rather than skating across the surface. i.e we looked past the shallow causes and into the system and the assumptions and the conditions that produced the incident in the first place Richard Cook's How Complex Systems Fail makes the perfect companion read on that same theme.
As a leader, my job is to start at the top with the word blameless and mean it. I'm signalling that it's safe to dig deep, and that fundamental gaps are welcome on the table. That might be underinvestment in monitoring and observability and we just need to step it up. It might be an architectural pattern that needs addressing. It might be a skills deficit inside the group. My job is to listen with an open mind and stay pragmatic about the actions that come out.
2. Sponsor and drive the actions back into your organisation
The second part is where a lot of organisations fall down. Once the learnings are extracted, somebody has to own them and push them through. That's me. As CTO, I'm actively looking at what we've learned from the incidents of the last month, the last three months - and, more importantly, how much of that we've actually actioned and put back into the product.
That work starts tactical and rolls into strategic. Tactically, it might be upskilling staff or improving a specific piece of technology. Strategically, it's the harder questions: is there an architectural pattern we need to address? Are we seeing the benefits of the technical strategic work we've already done? Are things actually getting more stable? It's a constant exercise of watching the activity feed back into the loop and making sure the actions are landing back into Product.
I treat this like being the product owner of the resiliency backlog. The feedback loop is the learning from incidents, and someone has to drive the change back in. That's a cultural aspect - continuous improvement - that I'm pushing from the top. Sometimes it means budget. Sometimes it means roadmap prioritisation, or fighting for prioritisation against competing demands.
Often I have I heard we can't get the product ownership or the business side to prioritize the resiliency pieces? When I hear this, I always go back and say this is a job of the CTO or the senior technical leadership team to lobby and prioritise.
ou can't always get it every time you ask, but over the long term you should see the items getting actioned, prioritized, budgets and implemented. Fundamentally, that's the job of a CTO: actions are landing into the product backlog and getting into the sprints
Often it means giving the work explicit focus and air cover so it doesn't quietly slip. It's on my agenda at the weekly, monthly and quarterly strategic level. If I'm not actively pulling those actions through, no one else will.
3. A practical AI lens: listen, then close the loop on your tooling
Under this principle specifically, if you're introducing AI into your operations, the same feedback discipline applies - just with a new surface area to listen to.
When AI is in the loop on incident response, I want to know from the team what's actually working, and more importantly what isn't. What are the adoption problems? Where is the tooling falling short? Are people having to work around it, or compensate for what AI leaves behind? I take that feedback with an open mind and close the loop on it. Maybe there's more work to be done on the tooling side - and an opportunity presents itself to provide feedback to the tooling group. Maybe there's training needed to adapt to it. Maybe there are real gaps in the tool itself.
Either way, the learnings from AI-leveraged incidents are too valuable to leave on the table. We're literally changing the landscape of incident response, and every CTO should be focused on that.
Closing the resiliency loop
Put all of that together and you have a closed resiliency loop: create the space to extract real learnings from the incident, drive those changes back into the system, and keep listening to the team - including on how your AI tooling is performing and how your on-call staff are evolving - so the loop keeps improving. The CTO sits in the most important seat in that loop. No one else has the same combination of visibility, authority and accountability to make all of it happen.
If you don't do this work, it comes back to you anyway in the form of repeated severe incidents and persistent platform instability. So I see it as fundamentally a proactive, not reactive move. Every incident is an opportunity to make the platform a little harder to break next time. Measuring the learning, and acting on it, is how we cash in that opportunity.
.png)




