Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
Karan Nagarajagowda, Senior Customer Success Engineer at Uptime Labs, designs and builds the platform’s realistic incident simulations. Before joining Uptime Labs, he spent 14 years on the front line of major outages - leading response teams at Morgan Stanley, Credit Suisse, Fidelity, IG Group and Tata Consultancy Services.
Miranda Hartley is Uptime Labs' editor.
Here, Karan talks big bank incidents, blame culture and designing the most realistic possible drills.
'Turning experience into teaching and helping people'
Miranda Hartley (MH): Let’s start at the beginning. When did you join Uptime Labs, and what drew you to the role?
Karan Nagarajagowda (KN): After 14 years of incident-response work at banks like Morgan Stanley, Credit Suisse and Fidelity, handling outages had become muscle memory. I liked the idea of turning that experience into a product that can teach and help people.
'I suddenly realised I knew nothing about the process'
MH: Your first incident back in banking sounded daunting. How did that shape the way you now teach new responders?
KN: During my very first incident call, a senior engineer handed me the phone - ‘Here’s the alert, escalate it’ - and I suddenly realised I knew nothing about the process. That moment kicked off a five-year learning curve in enterprise support.
Coming out of university, I’d assumed IT was just coding. I soon discovered the full ecosystem: support, QA, product, even marketing - all coordinating under intense time pressure when things break. It took those years to grasp how each team fits together and why incident response is a distinct, fast-moving discipline in large organisations.
I’m passionate about compressing that curve: if I can distil those years into a 90-minute briefing, a newcomer can start contributing almost immediately.
'The early years are the foundation'
MH: That’s an interesting point - the Uptime Labs site promises ‘years of incidents in ten days.' How does that work?
KN: If participants are ready for the intensity, five to six years’ worth of hard-won experience really is achievable. Those early years are the foundation; recreating them quickly is what matters.
'Anything that can fail will fail'
MH: UK retailers keep hitting the headlines with outages. What’s your take?
KN: I’m a firm believer in Murphy’s Law: anything that can fail will fail. Instead of trying to stop every incident, focus on recovery and on ensuring the same failure never happens twice.
‘'Burnout’ wasn’t even in our vocabulary'
MH: But surely repeated incidents or failures are inevitable, if the fix isn't sufficient the first time?
KN: Absolutely. We used to see the same failure fire over and over. At one bank, a single underlying defect triggered more than 120 separate outages. Each time it happened, we logged an incident and linked it back to the same root-cause ‘problem’ ticket, so the record eventually showed a long chain of 120 linked incidents.
Why did it keep repeating? The product really needed a full redesign, but the development team couldn’t tackle that overhaul straight away. Meanwhile, we lived with a stop-gap: every time the issue surfaced, we applied a well-worn workaround, restored service, and moved on - knowing it would break again.
That pattern is common in large organisations running legacy platforms. At one of my organisations, for example, we were still supporting code written in the 1960s and ’70s. It worked, so the business kept it alive, and the ops team kept firefighting until a proper rebuild could be funded.
‘First-responder life is like firefighting'
MH: Repeated incidents sound exhausting. How do you avoid burnout?
KN: First-responder life is like firefighting — you don’t control how many blazes ignite on your shift. Early in our careers, we simply powered through 13-hour incident marathons; 'burnout' wasn’t even in our vocabulary. Today, I coach teams to finish the job, then truly switch off when the pager is silent.
'The expectation was to handle every single incident on time'
MH: And what about blame culture?
KN: In my first job, I handled about 50 incidents a day. When I say escalations: we’d get 50 to 60 incidents a day - yes, 50 to 60 - and the expectation was to handle every single one on time. As the first responder I had only five minutes to look at each incident and decide: do I escalate it to Level 2 or Level 3, or do I follow the existing workaround, finish the job, and close the ticket?”
I was expected to give every single one the same level of detail in the escalation log. If I missed even one, my manager - five or six desks behind me - would stand up and yell across the floor: ‘Karan, you’ve got an escalation on your name!’ and the whole floor would hear.
When I later became a lead, I set things up differently. In my team, we never point at the person - we point at the system. If a human can make a mistake, it’s because the system lets them. So every post-mortem starts with: 'How did the system allow this?' and ends with changes to make it resilient enough that the same error can’t happen again.
'In my favourite drill, players must dismiss decoys quickly'
MH: Which incident drill are you proudest of?
KN: ‘Discount Disaster.’ On the surface, it’s a DDoS attack, but I seeded red herrings - a flash-sale traffic spike and an internal load-test - so the same symptoms could stem from different causes. Players must dismiss those decoys quickly before involving security.
'I had to Google the term mid-incident'
MH: For newcomers, that sounds intimidating. How much experience do they need to tackle an incident drill like this?
KN: I first met a real DDoS after 12 years in the field and had to Google the term mid-incident. If someone tackles our drill in month two of their career, they’ll at least recognise the pattern next time and won’t waste precious minutes searching for definitions.
'I’ll grab a single word and improvise'
MH: How do you translate an idea, like DDoS, into a finished drill?
KN: We keep a #drill-ideas channel. I’ll grab a single keyword - say, ‘log-rotate’ - and improvise a text-only scenario with the team. Every run produces new questions; after countless iterations, we package the narrative, add telemetry and ship it to alpha testers. More iterations follow until the drill flows naturally.
‘My drills give everyone a chance to 'bat, bowl and field'
MH: What makes a good finished drill?
KN: It must force the player to show every core skill - communication, triage, escalation, and technical troubleshooting. Think of cricket: a player judged only on bowling is unfairly labelled a poor batsman if he never gets to bat. My drills give everyone a chance to ‘bat, bowl and field.’
‘Measurement must follow opportunity'
MH: Which upcoming product features excite you most?
KN: Richer reporting - today we score 40 behaviours; I want 80. A report should mirror how a manager sees their engineer in real incidents. Secondly, I’m refining drill design so those behaviours are actually observable; measurement must follow opportunity.
‘Never let the same failure beat you twice - everything else is secondary to that’
MH: Finally, what’s the single takeaway you hope every responder leaves with?
KN: Connect the dots, see the bigger picture and never let the same failure beat you twice. Everything else — tools, titles, even years of service — is secondary to those habits.
Interview edited for clarity and length.




