A Brief Walkthrough of the History of Incident Response (and the Need to Adapt)

Hamed Silatani
|
September 18, 2025
Taggs:
Best Practices
Blog
Incident Management
IN THIS ARTICLE

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

I’m at a stage of life where I have to think about my son’s secondary school and tutoring (the fun that parents in the UK are exposed to - Scandinavia, you are missing out big time!). I was amazed to realise that online tutoring has become the norm; 5 years ago, that was not a thing.

It’s a reminder that innovation comes from needs, and it also creates new needs - both technical (child-safe computers and applications to deliver the lessons and interact) and skills (my son now needs to learn how to stay focused in front of the screen and interact with a remote tutor and submit answers on an application). It’s an endless loop that has been gaining pace. Tutors who adapted quickly are making a killing. On the other hand, those who are less adaptive are less profitable.

So how does all of this link to incident response? I suddenly linked the need for online tutoring to my subject of passion - incident response - and asked myself, “Why do I desperately see a need to radically change how we train for on-call in our industry?” Why was I ok with it in 2005, and I’m not OK now? (in fact, not ok to the point of setting up a whole business around it?).

I thought I would try to look at the question through the lens of circular dependency between need and evolution, one gives birth to the other. The pace of change in the IT industry has been gaining momentum in the last 30 years. Factors such as adoption, impact, complexity, rate of change, number of practitioners, specialisation of roles, interdependencies, and number of technical components are all going up. The result? The ways of working, practices, or technologies that were suitable 10 years ago are no longer fit for purpose. We already see that the higher-level abstraction and automation that AI is offering are changing the needs of tomorrow rapidly, and we have to adapt.

My aim is that through this exercise, I can cover various practices that are used for training for incident response (on-call) with a rich background on how they emerged. It’ll also allow me to imagine what the future of training for incident response needs to look like to support emerging new needs.

1970s

In the 1970s, life was much simpler. Software Engineers used to cover a lot of aspects of building systems, from requirement gathering (mostly sitting next to the business stakeholders), designing, writing code, deploying code, installing operating systems, DBs, writing SQL, doing QA, and fixing systems when they behave unexpectedly (incident response). There were a lot fewer engineers working on systems compared to today, and generally, users didn’t expect high availability. You could still buy your pizza if systems went down.

Though systems weren’t ‘simpler’ per se (OS360 was very complex, for example), they were more contained and had fewer dependencies. Engineers had a good overall idea of how the system worked, they certainly had stronger vertical skills (which were very handy for troubleshooting), and a smaller number of people were involved. These elements on their own already make incident response much simpler. Add to that less pressure or interest from executives, which I bet would have ironically resulted in a smoother response and faster recovery time. I’ve not done any proper study to back any of the above, and I think there is room for a decent study (so I’m open to comments and being educated here!)

This era was the time of learning incident response on the job, via osmosis.

1980s

The 1980s marked the beginning of real specialisation within technology teams. Instead of 'one engineer does everything', roles started to fragment: database administrators focused solely on data, QA engineers tested for quality, network engineers looked after connectivity, and software engineers concentrated on writing application code. Each group brought its own expertise - and its own perspective on what mattered most.
Overall, this made systems more sophisticated, but it also meant that incidents now required coordination across multiple disciplines. A network glitch, a database lockup, or a code bug might all contribute to the same outage, and solving it often meant bringing several teams together, each speaking a slightly different technical language.

Thus, as systems became more complex, the human side of incident response got trickier: who takes the lead, how do you share information, and how do you avoid delays while specialists debate root causes? There’s little evidence of formal 'incident response training' in this era.

1990s

The 1990s changed the scale of everything. The rise of the internet and the dotcom boom pushed systems into new territory: web applications, e-commerce and 24/7 global availability. Suddenly, downtime wasn’t just inconvenient: it meant lost sales, angry customers and headlines. Incident response became more visible to executives, though practices were still often ad hoc.

Two famous cyberincidents were the Citibank heist (1994) and the Melissa virus (1999).

Firstly, the Citibank heist (a cybercrime committed by Russian hacker Vladimir Levin) contributed to the awareness that incidents were a business problem, not just an engineering problem. The fact that it cost Citibank $10.7 million made that fact very clear.

The Melissa outbreak - a rapid-spreading internet worm - prompted questions about preparedness, especially whether federal agencies / and organisations had adequate processes in place for detection, containment and recovery. The virus made it very clear that many systems were not ready for the growing threat. A notable quote in this GAO report is ‘Because of the increasing reliance on the Internet and standard COTS products as well as the increasing improvements in computer attacker tools and techniques, (as evidenced in the additional capability and techniques employed in the Melissa attack), it is likely that the next virus will propagate faster, do more damage, and be more difficult to detect and to counter’.

2000s - 2010s

In the 2000s, software teams started moving away from the old 'waterfall' way of working — i.e. big plans up front, long projects and releases that took months or even years. Instead, agile practices took hold, with shorter cycles, quicker releases and a focus on learning and adapting as you go.
This sped everything up: features and fixes could be shipped much faster, and teams could respond to customer needs sometimes almost in real time. But that faster pace also meant more changes hitting production more often, which naturally created more chances for things to break. Incident response had to keep up with that new rhythm.

Two of the era’s most notable outages happened in 2007 - the Blackberry service outage & the Skype Global Outage. The irony was that most people relied on Blackberry to be alerted about their own incidents.

In April, BlackBerry users across North America experienced a multi-day disruption to email and messaging services. Research In Motion (RIM) reported that a critical core switch failed, and the backup system didn’t function as expected. The incident highlighted weaknesses in RIM’s failover design and shook confidence in the reliability of BlackBerry’s once-dominant messaging platform.

In August 2007, Skype suffered a two-day worldwide outage affecting millions of users. The trigger was Microsoft’s Patch Tuesday, which caused a wave of Windows PC restarts that overloaded Skype’s peer-to-peer network. The deeper root cause was a flaw in Skype’s resource allocation algorithm, which couldn’t handle the sudden synchronised demand.

Both incidents underscored the reality of the new era: as systems became faster-moving and more complex, incident response practices had to evolve just as quickly to prevent single points of failure or unexpected surges from spiralling into global outages.

2010s - 2020s

By the 2010s, incident response had become vastly more complex. Frameworks like ITIL (the Information Technology Infrastructure Library), which originated in the UK in the 1980s via the Central Computer and Telecommunications Agency (CCTA), had by this time matured through multiple versions (ITIL v2, v3, etc.) and become widely adopted as a way to codify IT service management and align IT operations with business needs.

Meanwhile, the rise of cloud computing, mobile apps, microservices and the formalisation of DevOps and Site Reliability Engineering (SRE) dramatically changed how systems were built, deployed, monitored, and maintained. Teams grew more specialised (cybersecurity, platform, automation, product roles) and disruptions began to carry much higher business risk.

Outages could impact customers globally, attract executive attention and damage reputations (and profits) in real-time. Training and best practices lagged somewhat behind the technical change, but shifts in culture helped. For example, in 2012, John Allspaw published 'Blameless PostMortems and a Just Culture' (at Etsy), arguing that to learn from failures, organisations must focus on how mistakes happen rather than who made them.

These cultural shifts, together with more formal ITIL-style standards and structured incident response practices, marked a turning point: reacting to inevitable failures was no longer just an engineering inconvenience, but a business necessity.

Two well-known incidents from this era were Knight Capital and Amazon’s US-East-1 region's cascading failure.

In April 2011, a routine network change in Amazon’s US-East-1 region triggered a cascading failure in the Elastic Block Store (EBS) service. Misrouted traffic led to a massive “re-mirroring storm” as volumes tried to recover, overwhelming capacity and causing widespread outages across EC2, EBS, and RDS. Recovery took days, and a small percentage of volumes were permanently lost. Th
Because of this outage and its post-mortem, cloud architects began to take multi-AZ and multi-region design far more seriously. Redundancy was rethought across not just compute, but also storage, control planes, and networking. Practices like explicitly testing for AZ failures, simulating partial network loss, and building systems to degrade gracefully under capacity pressure became more common.

For anyone who’s worked in financial services, the name ‘Knight Capital’ should be immediately evocative. In 45 minutes, Knight Capital's trading system generated over 4 million unintended orders for just 212 customer trades, resulting in positions of nearly 397 million shares and losses of $460M. Internal systems produced 97 error emails before market open, but they weren’t configured as actionable alerts. The lack of automated kill switches and inadequate pre-trade risk controls violated SEC Market Access Rule 15c3-5, leading to a $12M fine and mandatory compliance reforms.

Taken together, the 2010s showed that incident response was no longer a side concern for engineers alone. With the rise of globally distributed systems, interconnected financial platforms and regulatory scrutiny, responding to failure had become a business-critical capability - and one that demanded cultural, technical and organisational change in equal measure.

2020s

The 2020 pandemic normalised remote working, changing standard modes of in-person communication - such as feedback loops and informal discussions which were phased out (and across 2019 to 2021, Slack’s value goes up 400%).

Some incident responders suddenly found themselves facing a new challenge - working across different time zones. It was a sudden change that put pressure on the value of human communication,

Which is not to say that incident response tools didn't progress. Observability platforms became more integrated into the IR lifecycle: better logs, metrics, traces, real-time dashboards, alongside tools that help to automatically generate incident timelines and summaries. There was also a rapid growth in tools that combine alerting, collaboration and postmortem reporting together, sometimes with generative AI components to write/suggest drafts of postmortems or reports. Until kate

Two notable incidents were the Southwest Airlines meltdown and Crowdstrike.

In December 2023, the U.S. Department of Transportation hit Southwest Airlines with the largest consumer protection fine in its history ($140 million) penalty, after the airline’s holiday meltdown the year before. During the last week of December 2022, severe winter storms triggered cascading failures in Southwest’s crew scheduling systems.

While other airlines recovered in a few days, Southwest exacerbated the customer chaos. More than 16,900 flights were cancelled over 10 days, leaving millions of passengers stranded. The DOT’s investigation found that the airline failed to provide timely refunds, reimbursements and even basic customer support. Alongside the fine, regulators forced Southwest into a consent order requiring major system upgrades and corrective measures to ensure something like this doesn’t happen again. And so far, touch wood, there hasn’t.

In July 2024, a defective content update to CrowdStrike Falcon for Windows caused widespread system crashes (Blue Screens of Death). The incident disrupted critical services, prompted an emergency alert from the U.S. Cybersecurity and Infrastructure Security Agency (CISA), and led CrowdStrike to release technical details and remediation guidance to affected customers.

Together, these events underscored how the 2020s had raised the stakes yet again. With systems more interconnected than ever, failures could ripple globally in hours; with teams distributed, communication breakdowns were as dangerous as technical bugs. Incident response was no longer just about restoring services - it was about maintaining trust at scale, under the scrutiny of millions of customers, regulatory bodies and entire industries.

2025 (The Present)

I wanted to contrast decades-old practices of learning on-call 'by osmosis' with where we are today. The industry has evolved at breakneck speed, and that evolution has created the need for on-call training that is far more scalable, adaptable, and effective than the methods of the past. Old approaches simply don’t cut it anymore - it’s a bit like trying to copy a modern database using a floppy disk.

AI is now shaking every part of system development. Many major incident response software providers are rolling out AI-powered SRE agents that can detect anomalies, generate hypotheses, suggest fixes, and even kick off remediations faster than human teams can. That accelerates triage and tightens feedback loops, but it also shifts the balance: AI handles the routine noise, leaving the really complex, high-stakes incidents for human responders.

At the same time, AI is influencing how code itself is written. 'Vibe coding,' a term coined by Andrej Karpathy in 2025, describes an emerging style where engineers rely heavily on AI to generate infrastructure and application code. It raises an important question: when AI writes your systems, who is accountable when they fail?

We’ve already seen the risks play out in real life - including several high-profile outages in the UK retail sector, one of which has inspired our upcoming challenge drill this month. (If you’re an incident responder, this one’s for you!).

All of this points to the same conclusion: in 2025 the speed of change has suddenly spiked, and the implications for incident response training are massive.

Closing Thoughts

When I stroll through the history above, it’s obvious that incident response training practices have not evolved at the same pace; in fact, they are stuck in the needs of 15 years ago. That’s why I feel a desperate need for revolution. Evolution won't cut it anymore with the pace that we are going.

We need to move beyond static playbooks and tabletop exercises. These are simply not educational or engaging enough to effectively teach new generations of incident responders in the face of incident response’s evolution. The time for simply telling is over; practising now means plunging into immersive simulations that mirror real-world chaos and staging environments that reflect production realities. These approaches need to be safe - of course - safeguarding uptime while preparing teams for the unpredictable.

Our training is designed to accommodate the paradox of ‘preparing for the unpredictable’ via realistic incident response training. I’d invite you, after you’ve commented your thoughts, to try a free incident simulation. You might be surprised at the realism of it - but, in 2025, you shouldn’t expect anything less.

Hamed Silatani

Hamed is the co-founder and CEO of Uptime Labs. He has 20 years of experience in engineering leadership, reliability engineering and IT operations. Having spent the majority of his career at the sharp end of incident response in financial services, he's looking to help all companies master the unexpected.

Share this post

Ready to make incident response your competitive advantage?

— Chris Voss

See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.