Beyond "5 Whys": A Better Way to Learn from Incidents

Hamed Silatani
|
March 12, 2025
Taggs:
Best Practices
Blog
Incident Management
IN THIS ARTICLE

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Disclaimers Before We Begin

  1. Nothing I mention here is my original work. I’ve read, studied, and borrowed from experts in the field of safety. The only thing that is original is my own personal experience.
  2. I’ve practiced and advocated for “5 Whys” for many years. You may find material online from me in support of it—which is fine, and I’m proud of it. It’s just that after all these years, I’ve gained new insight. I hope you find it helpful.

The Purpose of Post-Incident Reviews

We all can agree that the most important purpose of a post-incident review (or post-mortem) is to learn from incidents. Implied in this learning is improving the system (people, processes, technology, and their interactions). All my reflections on the “5 Whys” technique refer back to how the technique enhances our learning (or not) from incidents.

What I Love About the “5 Whys” Technique

1. It Feels Intuitive and Easy to Explain

It’s intuitive and very much in line with human nature, how we like to think about things. Therefore, it’s very easy to advocate for it and get buy-in from managers and fellow practitioners. It makes sense, right?

2. It’s Simple to Implement

It’s very simple to implement and start. I even had a template and script for engineers and incident managers to follow. Ask these, document the answers, post-incident review done! Who said post-incident reviews are hard?

Why I Stopped Using "5 Whys"

After a few years of running with “5 Whys” I started to have doubts when I noticed I’m not getting the result I was hoping for “learn from incidents”. This was when I started to ask “Why” on “5 Whys” and study learning from incidents a little bit more seriously.Admittedly the literature available was way above my intellectual capacity at the time. So I’ll try to explain using a personal experience, if you are thirsty for real science , you can start with John Allspaw’s post on The Infinite Hows (or, the Dangers Of The Five Whys) and the before you do a 5 WHYs root cause analysis video from Steven Shorrock .

1. The Path of Discovery Was Too Narrow

Even though each "Why" could have more than one path to follow, the “Why” question itself implied that what I was asking “Why” on was at fault, a mistake, or something that should not have happened. This approach doesn’t help describe what happened—it attempts to explain what happened, which naturally introduces bias.

This is not helpful, because incidents can still happen even when things work as per design and expected. You follow this path, at best you cater for one more specific scenario. It does not mean that you understand circumstances around the given incident.

2. The Path of Discovery Was Predetermined

It felt like we were trying to justify what we thought was at fault. The first "Why" question gives an overall direction to what is noted as an explanation. Often, people mentally think several “Whys” ahead before constructing the first “Why” question.

For example:
“Why didn’t automated QA catch this issue?” → I’ve already decided that our QA process is insufficient. How am I going to learn something new?

3. "Why" Questions Can Sound Accusatory

In coaching, we tend to avoid questions framed with “Why?” because:

  • It can sound accusatory and lead to defensive responses
  • It doesn’t encourage open exploration or self-discovery
  • It tends to invite rationalisation rather than reflection; and rationalisation tends to invite people to reference their existing beliefs about what constitutes “good performance”, rather than reflecting on what actually happened.

Real-World Example: A Post-Incident Review with “5 Whys” vs. Open-Ended Questions

Let’s me go through a real life example, I’ve abstracted some details (for obvious reasons) but tried to stay specific enough to communicate the point:

The Incident: A 4-Hour Outage & Degraded Service

Incident: “We had a 4-hour complete outage followed by several days of degraded service.” It was terrible! Not cool. The issue was half-noticed by an engineer on Monday morning when he was doing his usual curious checks and poking around.

How The "5 Whys" Would Frame the Incident:

WHY 1 – Why did users get “service unavailable”?

  • The internal DNS was messed up, and services couldn’t find each other.

WHY 2 – Why was the internal DNS messed up?

  • A deployment the night before (Sunday).
  • The wrong config was loaded to the K8s DNS service.
  • It wasn’t QA’d properly.

WHY 3.1 – Why would someone do a Sunday night deploy?

  • Because they wanted to deploy their work they finally completed over the weekend.

WHY 3.2 – Why didn’t the QA process catch it?

  • The engineer thought his own smoke test would be enough + standard automation regression.

Key Issue with “5 Whys”:

No matter how far we push the "Whys", the learning and actions will almost certainly revolve around improving QA and a better deployment discipline (whatever “better” is). There is value in both actions, but this approach misses massive learning opportunities!

Open-Ended Questions Instead of "5 Whys"

Now let’s go through how we ended up doing the post incident review and the impact changing of questions had: (I’ll only share few examples to illustrate the point, we don’t have space for all of it)

I spoke to people involved separately for the initial chat, building a picture by avoiding "Why" and using “How” and “What” questions instead.

1. Learning from the Engineer Who Noticed the Issue

Me: How did you notice there was a problem?
Engineer 1: I was just doing my usual checks on Monday—it’s a habit.

Me: Interesting, so do we do these checks every morning?
Engineer 1: No, it’s a me-thing. A habit from my last job.

Bingo! How many incidents has Eng 1 prevented in the past without knowing? Is this something everyone should be doing?

2. Learning from the Engineer Who Deployed the Change

Me: What were you expecting to happen after triggering deployment?
Engineer 2: We do this all the time, so I expected my small changes to go through automated tests, get deployed, and I’d run a quick smoke test.

Me: What was surprising to you?
Engineer 2: My initial smoke tests were fine, but by the morning, things progressively got worse. Also, I accidentally deployed a bunch of other changes.

Bingo! Two crucial insights:

  1. Unintended changes were deployed
  2. The issue worsened over time instead of appearing immediately

Me: Have you ever had an issue with deploying unintended changes before?
Engineer 2: Not at this company, but it happened ages ago when a big project sat in a separate branch for too long before release.

Another Bingo! We’re learning about deployment practices, not just “fixing QA.”

Key Takeaways: Why Open-Ended Questions Work Better

I'll stop here and let you judge for yourself the difference in richness of the learning in the second case where I avoided “Whys”. on the back of this incident we made some big changes in the way we work:

  • Morning checks – Encouraging engineers to perform habitual system checks
  • Shared knowledge on DNS & cache issues – Internal posts and discussions
  • No more big platform changes on separate branches – Weekly deployments instead
  • Reframing deadlines & estimates – Actively reinforcing that it’s OK to ask for help when estimates are off

If the goal of post-incident reviews is learning, then “5 Whys” limits that learning by narrowing discovery, reinforcing biases, and preventing open exploration. By shifting to open-ended, reflective questions, we unlock more meaningful insights and systemic improvements.

Would love to hear your thoughts—what’s your experience with “5 Whys” in post-incident reviews? Drop a comment below! 👇

Hamed Silatani

Hamed is the co-founder and CEO of Uptime Labs. He has 20 years of experience in engineering leadership, reliability engineering and IT operations. Having spent the majority of his career at the sharp end of incident response in financial services, he's looking to help all companies master the unexpected.

Share this post

Ready to make incident response your competitive advantage?

— Chris Voss

See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.