Turning Non-Prod Incidents into Resilience-Building Opportunities

Joe Mckevitt

March 3, 2025

Taggs:

IN THIS ARTICLE

Incident Timeline

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Book a demo

Explore the platform

If you couldn’t tell, we are obsessed with incident response at Uptime labs. So, when we get our own, we see it as an opportunity to practice what we preach.

I wanted to recap the events of one of our most recent incidents – one where one of our team members accidentally deleted a critical resource in our login flow for the development environment .

Here’s how it unfolded and what we learned from it:

The Incident

The first sign of trouble came through our team Slack channel—a screenshot from a colleague reporting, “We can’t log in to dev.” Importantly, they also noted, “This is a dev-only issue; production is working fine.”

My immediate thought process kicked in:

Are we getting hacked?
Are we sure this isn’t affecting production?
Do we have this under control?

To quickly diagnose, I called out in Slack: “Hands up—does anyone know why this happened? Did anyone accidentally remove or change something?”

Within minutes, a developer owned up. They admitted, “Yep, it was me. I made a mistake.”

The Immediate Response

At this point, the situation was clear:

Not a security incident. I could downgrade the severity of the incident and focus on resolution.
Under control. It was a human error, not a systemic or malicious issue.

I immediately thanked the reporter for flagging the issue clearly and confirming it was isolated to dev. I also thanked the developer for their courage in owning the mistake.

The Learning Opportunity

Every incident—production or non-production—is an opportunity to improve. As a leader, I approached this as if it were a production incident:

Post-Incident Review (PIR): We treated this with the same seriousness as a production issue, analysing what went wrong and how to prevent it in the future.
Systemic Changes: We recognise that the developer’s actions were a symptom of a broader system issue. The focus wasn’t on blame, but on improving the system to ensure such errors couldn’t happen again.

The Culture of Blamelessness

Its important to recognise, at least in my experience, that building a resilient organisation starts with fostering a culture of blamelessness—one where every team member feels safe to acknowledge risks, call out failures, and admit mistakes.

It’s not just a moral imperative but a practical one: when people are afraid to speak up, critical information gets lost, and the entire system suffers.

At the heart of a blameless culture is psychological safety. This means ensuring that team members:

Feel confident that raising concerns or reporting failures won’t lead to punishment or humiliation.
Understand that mistakes are not a reflection of personal incompetence, but opportunities to identify and address gaps in systems or processes.

Blame doesn’t solve problems; systemic improvement does.

Instead of asking, “Who is responsible?” focus on “What allowed this to happen?”
Recognise that mistakes are often the result of flawed processes, unclear expectations, or inadequate safeguards, rather than individual negligence.

In this case, we implemented technical changes to eliminate the possibility of the same mistake occurring again.

Key Takeaways

Even non-prod incidents matter. Treat them as seriously as production incidents to drive meaningful improvements.
Always learn. Every incident is a learning opportunity to refine systems and processes.
Foster a blameless culture. When people feel safe to speak up, the whole organisation benefits.
If you are a leader…call it out. When you see someone embodying the blameless culture ethos—whether flagging a risk or admitting their mistakes—acknowledge it. Over-communicate this point: celebrate the courage, share an internal memo or go a step further and write a whole blog post about it 😉. Whatever it takes to highlight the importance of these behaviours in your organisation.

This incident may have been small, but it’s helping us build a stronger, more resilient organisation. Never waste a good incident—even if it’s not in production.

Joe Mckevitt

Joe is the co-founder and CTO of Uptime Labs. A passionate technologist and developer, he has 17 years’ experience in building and scaling high-performing products and teams. Also a marathon runner, he’s wired for high performance. He loves creating cultures of constant innovation, and coaching people to develop their full potential.

Turning Non-Prod Incidents into Resilience-Building Opportunities