Baptism Of Fire - The Story of Every On-Call Engineer

Hamed Silatani
|
September 2, 2025
Taggs:
Best Practices
Blog
Incident Management
IN THIS ARTICLE

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

I was 6 months into at a new job as an engineering team lead. After two months of shadowing experienced colleagues on incidents — though only during office hours — my time had finally come to take the BlackBerry.

(I’m revealing my age, but for the younger generation, think of PagerDuty equivalent in the Stone Ages)

I was nervous but felt ready. Over the course of two months of shadowing, I had the chance to observe and learn how experienced engineers respond to incidents, or at least I thought I had learned.

In my first week of support, every evening as I lay down to sleep, I was hopelessly staring at the BlackBerry at my bedside, thinking that it's a matter of when, not if, for it to buzz. Monday and Tuesday were pretty much eventless, and on Wednesday, around 6am, the inevitable happened. It was an alert followed by a call from our Australian office, “We’ve a big problem, the platform is unresponsive.” Thankfully, by the time the call ended, it was all good again.

(For the context, the platform was a web-based financial trading platform for retail traders (they get really angry when they can’t access their trades)),

I thought I was lucky for the first real incident, but I immediately realised that all the time I shadowed an on-call colleague, it had been during office hours with loads of support around. This time I was on my own at an unsociable hour to wake up others. Feeling of being on my own and facing a potentially serious issue that I had never seen before was terrifying. When shadowing an expert, the environment and conditions in which a trainee experiences incident response are very different from the environment and conditions that the trainee will face when they run an incident. The first shortcoming of shadowing!

The next step was to send an incident update email (we did not have an ITSM tool at the time; all incidents were recorded on a Wiki). The moment I started to write an update, I realised that although I had seen many incident communications while shadowing, I had never done it before at this company.

It always looked relatively straightforward when the person I was shadowing wrote these. They never spent much time working out the format and content; they just wrote it. Now I know that expertise can mask challenges; writing comms is not easy, but they were experts at it and made it look easy. It’s hard for a trainee to notice expertise only by watching an expert performing actions. The second shortcoming of shadowing!

After studying the previous incident comms, I finally managed to write mine. I double-checked that it was going to the correct distribution list, the timings were correct and I made it clear that the disruption was over and service was back to normal. I also described the impact as I understood it.

As soon as I triumphantly clicked the ‘Send’ button, the BlackBerry started buzzing, and alerts came back. I couldn’t believe that the exact same issue was happening again. And I immediately felt stupid for sending the email!

The same person from the APAC office called me, and by the time the call was complete, the system was back up again. My boss called me very annoyed because he got a call from the CTO. His first 2 questions were: for such an important issue, why didn't you call me? And why did you send an email that the issue is resolved when clearly you didn’t know what was going on? He went on to explain that he never sends a resolution email until he fully understands what caused the issue and is confident that the cause has been addressed.

This conversation reveals the third shortcoming of shadowing, when I was shadowing I saw what the expert did but I did not have privileged access to his mind to understand how he made decisions and why he did what he did so quickly. In real life, this is a pretty obvious point; you can never learn to drive a car by sitting in the passenger seat and observing. It’s impossible to develop skills and later expertise without practising!

Obviously, organisations often initially use shadowing to train new on-call staff. They often start by letting trainees handle incidents during office hours, when an expert is available to step in and correct mistakes. I had a couple of weeks like that, too, but in reality, when a Sev 1/2 happens, the expert immediately, and rightly, takes over running the incident because the goal is to restore the service as soon as possible. The side effect is that it limits skills development.

To give credit to my employer at the time, they invested considerably in preparing engineers for their support week, mainly because the impact of service disruption is huge for financial services companies that process a large volume of transactions. It extends far beyond financial considerations (which are in the order of thousands of pounds per minute); it also has a regulatory and reputational impact.

We had classroom training in the first week of joining, followed by monthly fire drills, tabletop exercises (similar to incident lunches), and a months-long shadowing program. Yet, the support week was dreadful for all engineers. Going into support week after five weeks of pure development focus meant Fishy Skills were in a rotten state.

The core of the problem is industry-wide. It amazes me that after 15 years, methods of training for incident response have not evolved much, and every organisation try to solve it in isolation. And I’ve yet to meet an engineer who doesn’t nod when I say incident response is the hardest job in IT!

This is astonishing when you think about it. The hardest and most critical job (IT systems are all mission-critical these days) in the IT industry does not have structured and effective training. I let this sink in a little.

Let that sink in - the difficulty of incident response pun image

In reality, the only way to learn how to respond to incidents until now has been through running real-life incidents, which is immensely stressful, expensive and takes years (unless you receive a Sev 0/1 incident every week). Uptime Labs simulations for the first time are offering an alternative that is fun, costs less, is significantly faster way of learning, and more effective in developing skills, as you can objectively track your skills development.

I’m not arguing that staged world simulations, like those offered by Uptime Labs, replace the other methods used today to train folks on incident response. My point is that the real skills development happens when the trainees actually do the work in a real-world environment (which is expensive, stressful and takes years) or a simulation environment that is realistic enough to exercise the exact same muscles that are needed in real-life incidents (this environment in the academic world is called the Staged World). You can easily experience it first-hand by clicking ‘Try it for free’ in the top right corner.

The incident in the story is beyond the scope of this post, but for readers who need closure, it was related to a lengthy Java Garbage Collection, which was caused by insufficient memory allocation. A seemingly small and non-critical application was regularly suffering from lengthy garbage collection, which meant that all incoming API calls to the application would hang as well. This caused all applications with synchronous API calls to this ‘non-critical’ application to stall. You can imagine how quickly the rest of the dominoes fell.

Hamed Silatani

Hamed is the co-founder and CEO of Uptime Labs. He has 20 years of experience in engineering leadership, reliability engineering and IT operations. Having spent the majority of his career at the sharp end of incident response in financial services, he's looking to help all companies master the unexpected.

Share this post

Ready to make incident response your competitive advantage?

— Chris Voss

See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.