What Makes a Great Incident Simulation Scenario (and What Doesn’t)

Karan Nagarajagowda
|
February 11, 2026
Taggs:
Best Practices
Blog
IN THIS ARTICLE

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Karan was a Staff SRE at IG Group, and also worked at Morgan Stanley, Credit Suisse, Fidelity, IG Group and Tata Consultancy Services. He is now a Senior Customer Success Engineer at Uptime Labs.

Real World Incidents vs. Simulations

Q: Are there any real-world incidents that don’t translate well to simulations?

Karan:
I don’t think there are incidents that can’t be simulated. Theoretically, you can convert any real incident into a simulation.

The challenge is that a simulation needs to be a comprehensive learning experience. We want incident managers to experience the variety of emotions that happen during an incident.

For example, if one application goes down and a restart fixes it - very common, and we can simulate it - but it might not deliver as much learning. It could work as an introductory simulations.

The bigger challenge is time. We typically want simulations around 40 minutes. Take ransomware: in a real company it took 4 and a half months to resolve. Compressing those learnings into 40 minutes is hard - especially because it can take months just to understand the ‘blast radius’ and how deep it’s penetrated.

But apart from time constraints, I’d say incidents are generally simulable.

Simulations Aren’t in Real Time…

Q: Where do you compress time when planning simulations?

Karan:
It depends on the simulations. We interview people who were in the incident and map out how it started, how it was discovered, when mitigation started and what the steps were.

We look at where teams spent the most time, and where they learned the most. That’s what we don’t want to compromise on. The learning is the bread and butter.

We compress time where little of consequence is happening- where nothing really moved. And we keep time for the phases that matter: diagnosis, understanding where the issue is and forming a resolution strategy.

….And This Helps Them Feel Immersive

Q: So compressing time can actually help recreate the pressure of real incidents?

Karan:
Exactly. And another big thing people complain about is information overload.

There’s the cause of an incident, and then there are symptoms. One application might show the cause; others show symptoms. People often mistake symptoms for the cause.

For example, one application slows down and that delay cascades downstream. After three or four hops, the last application shows the biggest deviation, and teams assume that is the problem. But the underlying issue is the tiny delay earlier on.

During incidents you get multiple engineers feeding in different errors and theories. Collecting that information and forming a strategy is hard - so we simulate that overload too.

The Difficulty in Designing a Cyber Incident

Q: How does the rise of cyber incidents influence which simulations you build?

Karan:
With normal incidents, you’re fighting your own system - something’s broken and you fix it. In cyber incidents, you’re also fighting an active adversary - someone intentionally trying to bring you down. This can add an extra degree of complexity.

In real life, people often assume Security Ops is responsible. But that’s not true. Developers, SREs, platform engineers - everyone has responsibility to protect the system.

Cyber simulations help teams learn that. They encourage Security Ops to work hand-in-hand with the rest of the organisation; not only to get out of the threat, but to protect the system in the future. That’s why we focus more on cybersecurity simulations.

Tabletops vs. Simulations

Q: If someone could run a tabletop (a discussion-based session), why choose a simulation instead?

Karan:
It’s like playing video-game cricket versus real cricket. Being good at the video game doesn’t necessarily translate into being able to play in real life.

In simulations, you get the same adrenaline you feel during a major incident. Same pressure, same need to form working theories quickly. You respond to what comes in; you don’t have time to sit and think.

Simulation is like a practice match: it doesn’t matter whether you win or lose, but are a great chance to practice. Tabletop exercises are discussions; you talk about what could have been done, but you’re not in the real situation.

That’s why I’d prefer simulations over tabletop exercises any day.

Old School Simulations

Q: So when you were leading SRE teams, would you have chosen simulations if the technology existed?

Karan:
Exactly. We used to have production, and then a lower environment (UAT/demo) that was similar. We’d intentionally run simulations there - take 40–45 minutes, set a deadline and resolve it within that time.

It helped me and the whole team prepare for bigger incidents. However, there was a lot of admin work to bring down an environment, and there was a risk of disruption for some users who were using the platform.

Making Meaningful Performance Stats

Q: How do you design simulation reports so they help people improve, rather than just presenting stats?

Karan:
There’s the report itself - and then reflection: What happened, what surprised me, what did I find difficult, what was easier than expected?

One of my managers used to say, “The actual incident starts once you close the incident.” Meaning resolution isn’t the end - learning is.

In real incidents, postmortems often become timelines for management or a way to assign blame. We want to remove that and build a blame-less culture.

Our reports focus on what the incident manager could have done to move things faster - reduce time to recovery - and build muscle memory for the next incident.

We break reporting into four areas:

  • Sizing the issue (understanding scope)
  • Incident mechanics (admin work: priority, descriptions, etc.)
  • Communication (how they coordinated with stakeholders and engineers)
  • Resolution strategy (the plan they formed to resolve)

We enable the user to reflect on how they did and what they might do differently.That’s where learning comes from.

Building Simulations is Fun

Q: What’s the best part of designing a simulation?

Karan:
For me, it’s alpha testing.

We bring in SMEs. If it’s a cyber simulation, we use cybersecurity experts - four or five people with different strengths. That’s when we uncover things we never expected.

I might design the simulation one way, but how players interpret it, and the feedback they give, that’s the best phase. Often we realise the simulation is giving more learning than we intended, and that’s fulfilling.

What Exactly Makes a ‘Bad’ Simulation?

Q: What makes a simulation ‘bad’? Boring - or unrealistic?

Karan:
There are two perspectives:

From the player’s perspective, a bad simulation is one that’s straightforward and easily resolved. Players want complexity; multiple working theories, real challenges.

From our perspective, a bad simulation is one where we had learning objectives, but the simulation didn’t deliver them. If the player resolves it without having to learn what we intended, that’s a bad simulation.

We also shouldn’t make every simulations complex. Some simulations target specific skills: sizing up the issue, communication, forming working theories. A mix of those - plus occasional complex simulations - is how you see players evolve.

But Good Simulations Look Like This…

Q: What else makes a good simulation?

Karan:
Good simulations have milestones. With a 40-minute time pressure, not every player will finish - but we try to design so they reach the milestones and get the intended learning.

The hardest part is designing:

  • what milestones we want them to reach
  • what learning belongs in each milestone
  • what signals we give to guide them
  • what red herrings we include (because real incidents include confusion too)

That’s what makes a simulation effective: reaching the milestones and cooking the learning into each stage.

One of the best things I heard recently was from someone who played a simulation about a memory issue. The very next day, they had the same issue in production. They resolved it in 15–16 minutes because they recognised the signals.

That’s when we feel proud of what we’re doing. That’s the ingredient we want in every simulation.

Karan Nagarajagowda

Karan is a Senior Customer Success Engineer at Uptime Labs, designing and building the platform’s realistic incident simulations. Before joining Uptime Labs, he spent 14 years on the front line of major outages – leading response teams at Morgan Stanley, Credit Suisse, Fidelity, IG Group and Tata Consultancy Services.

Share this post

Ready to make incident response your competitive advantage?

— Chris Voss

See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.