Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
The short answer is 'a lot' and definitely goes far beyond this blog post. I’m writing this post mostly selfishly to start consolidating my own thoughts and as a platform to get others' help to enrich my learning. So comment away and share your thoughts please!
(I must confess upfront that I used AI to research this piece. I’m fundamentally not convinced yet that AI can generate reliable and valuable original content but I used it find resources to study. I checked out all the references and read the background stories to satisfy my scepticism on AI making things up).
Part 1: Los Rodeos: Tenerife, 1977
A thick fog crawled across the runway at Los Rodeos, which reduced visibility to a mere few hundred metres. Inside the cockpit of a KLM Boeing 747, Captain Jacob Veldhuyzen van Zanten, the airline’s most senior instructor, was focused on departing before his duty time expired. As he advanced the throttles, the First Officer raised a concern: “Is he not clear then, that Pan Am?” The Captain replied, “Yes.”
There was no confirmation, no second-guessing the authority. Seconds later, the KLM jet lunged through the fog at full takeoff power, unaware that a Pan American 747 was taxiing directly in its path. The resulting collision remains the deadliest in history, causing 583 fatalities.
The incident wasn’t caused by a mechanical failure, but by the novelty of the circumstances and systemic culture and practices ingrained in the crew’s operating model. It reminds us that all mechanical and software systems can work exactly as expected, and incidents will still happen.
In my opinion, the actual incident was not the collision of the two jumbo jets; that was the outcome of handling an existing incident that had gone badly wrong. The real incident started hours earlier at an airport 160 km away. A terrorist explosion at the nearby Gran Canaria Airport forced a massive diversion of international flights to the much smaller, ill-equipped Los Rodeos Airport on the island of Tenerife. By mid-afternoon, the tiny regional airport was choked with more than a dozen diverted airliners, including five large jets parked on the only parallel taxiway because the main apron was overflowing.
This congestion forced any departing aircraft to ‘backtrack’, taxiing down the active runway itself to reach the takeoff position. To complicate matters, the airport’s centerline lights were unserviceable, and a sudden dense bank of sea clouds began rolling across the runway, slashing visibility from kilometres to just a few hundred metres in just minutes.
Neither the air traffic controllers in the tower nor the pilots on the ground could see one another; they were operating ‘blind,’ relying entirely on a single overcrowded radio frequency prone to signal interference.
Tenerife remains a tragic accident in aviation history because it revealed that a so-called perfect aircraft (the KLM jet) is still vulnerable to a flawed human system. The official investigation identified Unusual Traffic Congestion and Rapidly Deteriorating Weather as the catalysts, but arguably the culprits were systemic: Authority Gradient (an aviation term for the perceived decision-making hierarchy in a team), and ambiguous communication.
Moreover, because the KLM Captain was the airline’s chief instructor, his junior crew felt unable to challenge his decision to roll without a clear takeoff clearance. Furthermore, a critical radio transmission from the Pan Am crew - "We are still taxiing down the runway!" - was blocked by a ‘heterodyne' (a high-pitched squeal) because two pilots spoke at the exact same time. This tragedy proved that safety cannot rely on the competence of one leader. It requires a protocol where information is shared openly and every crew member has the 'right to pause’ the operation.
Part 2: Qantas Flight 32
33 years after Tenerife, the aviation industry faced a new test of its evolution. On November 4, 2010, the old way of autocratic command had been replaced by a rigorous, collaborative discipline known as Crew Resource Management (CRM). This new protocol was about to be pushed to its breaking point by Qantas Flight 32.
Qantas Flight 32 - a massive Airbus A380, the world's largest passenger jet - departed Singapore for Sydney with 440 passengers and 29 crew. In the cockpit were five pilots (an unusually high number due to a check-captain being trained and a supervising captain observing). The surplus of human experience would soon become the aircraft's most critical redundant system.
1. The Event (The ‘Startle’)
Four minutes after takeoff, climbing through 7,000 feet, the Number 2 engine exploded. This was an 'uncontained engine failure', a turbine disc had disintegrated, firing shrapnel through the wing. The damage report was a pilot’s nightmare: 650 wires severed, fuel tanks punctured, hydraulic lines lost and the anti-lock braking system disabled. Worst of all, the fuel transfer system failed, trapping fuel in the tail; leaving the 400-ton aircraft dangerously tail-heavy and unstable.
2. Managing the Digital Chaos
In the Old Way, a crew might have succumbed to ‘get-there-itis,’ rushing a broken plane to the ground. But modern protocol dictates a different priority: Aviate, Navigate, Communicate. As the Electronic Centralised Aircraft Monitor (ECAM) began screaming with 54 simultaneous system failures, a volume of data no human could physically read, Captain Richard de Crespigny took a decisive step: he silenced the master warning.
To clear the cognitive stress that had overloaded the crew at Tenerife, he issued a grounding command: "We must not get focused on what is wrong. We must focus on what is still working," quoting Gene Krantz, a NASA mission controller during the Apollo 13 crisis.
3. Buying Time
Instead of heading immediately for the runway, the crew entered a holding pattern, flying loops near Singapore for two hours. They used this time to perform a ‘controllability check' at high altitude, testing the limits of the damaged wing before they were committed to the ground. While de Crespigny focused on flying, the other four pilots functioned as a 'board of directors’. They split the workload: the First Officer managed the radio, while the additional captains ran the calculations.
They calculated a terrifying reality: with no anti-skid brakes and a heavy fuel load, they would stop with only 100 meters of runway remaining. In other words - there was zero margin for error.
4. The Touchdown and the ‘Engine That Wouldn't Die’
The landing was a feat of precision. De Crespigny applied maximum manual braking, stopping the jet just 150 metres from the runway’s end. But the incident didn't end at the stop. Fluid leaked onto white-hot brakes as a fire risk loomed, yet Engine Number 1 refused to shut down; its control cables had been severed.
This led to a final, critical decision: should they evacuate? A panicked decision might have led to disaster. De Crespigny weighed the risks of the running engine sucking passengers into its intake versus the fire risk on board. He chose to keep the passengers on the aircraft, coordinating with ground fire crews to 'choke' the engine with foam. The result was that everyone walked off the plane safely.
Part 3: What Stands Out For Me From Studying Both Incidents
The Tenerife incident was a catalyst for the Human Factors revolution in aviation. (Hopefully, the IT industry won't wait for a Tenerife-scale disaster to take Human Factors seriously; the signs are positive. Check out RISF, the Resilience In Software Foundation).
Let’s start with comparing the handling of both incidents:
I also think the use of mnemonics to reduce the cost of coordination under high-stress situations is very effective. They are used to refer to frameworks to create shared mental models and as a super-fast accessible memory to help operators remember their priorities under high stress. Crucially, everyone is trained to have the exact same understanding of the mnemonics;
- Priorities: Aviate, Navigate, Communicate
- TEM The Strategic Framework (The ‘Why’): Threat and Error Management
- FOR-DEC: The Tactical Framework (The "How") (Facts, Options, Risks, Decisions, Execution, Check)
During major incidents, I often get overwhelmed by the level of noise without even recognising it. It takes expertise to be conscious of saturation levels. The pilot team recognised the risk of saturation and information overload and made decisions to share the load and reduce noise by turning off alerts to gain cognitive capacity.
What really resonated with me is the fact that there is no 100% correct decision during a high-stakes incident. We are faced with trade-offs that make it really hard. De Crespigny was faced with the trade-off between the risk of losing lives during evacuation due to running the engine and the risk of fire engulfing the plane and killing everyone. It could have easily gone either way.
The high stakes and time pressure must have been almost incomprehensibly stressful for the pilot team, crew and ground staff. The only way to perform highly complex cognitive work under such high pressure is by having frequent team practice and training. It is technically impossible to practice every permutation of disaster; therefore, the aviation industry has shifted from scenario-based training (memorising a specific solution for a specific problem) to Resilience Engineering and Competency-Based Training and Assessment.
Beyond the brilliance of deliberate framework and practices, there was an element of luck. Regulations requires 3 pilot onboard A380 for 8-12 hours flight. QF32 had 5 pilots onboard. The 2 extra pilots happened to be on board for training. What a training it was!
Part 4: The Learnings I Think Apply to the IT industry
I have to quote my friend and colleague Stuart Rimell, who quoted Dr Richard Cook, answering a common challenge - that is, are IT industry incidents as important as healthcare incidents? (which also applies to other safety-critical industries) The question is implicitly asking “Does the IT industry need the same rigour?”
A common challenge to major investment in IT resilience is the claim that, compared to aviation, nuclear power, or the operating theatre, the stakes are lower, so IT is less ‘safety critical’. Listen to how Resilience Engineering pioneer Dr Richard Cook addressed this reasoning in 2012.
"What’s happening here is the lifeblood of commerce, the core of the economic system we now experience. Do you think that’s unimportant? The key question is whether healthcare’s importance matches that of web operations - not the reverse.”
Aviation’s flawed practices and the training program of the pre-1977 Tenerife disaster was far more advanced than what we have in the IT industry today. There were at least regulations and structured standard training to help the crew deal with incidents, albeit flawed. In today’s IT industry, every operator, every organisation is developing and learning incident response skills in isolation and on the job.
Aviation safety is governed by rules that are internationally accepted. A key benefit of international regulation and standards is that they can reduce the cost of coordination and maintaining common ground during incidents. If all on-call staff are trained on the same frameworks and principles, they have a good base to operate effectively under the high pressure of Sev 0s and Sev 1s. Imagine we had an equivalent of “Aviate, Navigate, Communicate”, FOR_DEC, or TEM that were universally understood the same way, no matter what company you are working for.
Aviation safety embraces Human Factors and Resilience Engineering. The principles are industry agnostic. Let’s look at a few examples that can inspire the IT industry:
- Evolution of leadership from autocratic to synergistic: in an IT incident bridge, what happens when a senior exec joins the call? Do junior staff feel comfortable challenging their CXO?
- Accepting that every mechanical and software part can work as expected, and yet incidents can happen. In our industry, we are immediately in pursuit of the root cause and the part of the system that misbehaved.
- QF32 cock-pit crew managed saturation well. They divided the load and reduced digital noise by turning off alerts.
Rigorous and frequent training and simulations are a critical part to enable the crew to apply frameworks and principles under high pressure. None of the crew in QF32 had ever practised for an uncontained engine explosion that severs 600+ cables and damages hydraulic controls and anti-locking brakes.
Recognising that it is impossible to practice for every eventuality, aviation training has shifted from scenario-based training (memorising a specific solution for a specific problem) to Resilience Engineering and Competency-Based Training and Assessment. The training sessions are not just about flying; they are 50% technical and 50% Non-Technical Skills (NTS).
Pilots are placed in LOFT (Line-Oriented Flight Training), which is a bridge between technical skills and real-world disasters. Instead of practising a single manoeuvre (like a stall recovery), LOFT involves a full-mission, real-time simulation that replicates a flight from gate to gate.

This is an important learning for the IT industry. We too need to recognise that technical skills alone are not enough to resolve incidents, and we need to practice teamwork, adapt to surprises, communicate and manage cognitive load in a real-life environment.
The last point is very close to our hearts at Uptime Labs. It is exactly why all of us have dedicated our time to building a simulation and practise environment that allows IT operators to practice skills that matter. I’ve no doubt that in the near future, Uptime Labs style simulation will become not only the norm but a necessity for our industry. I sincerely hope to see more platforms like Uptime Labs to emerge. We have a duty to humanity, who is increasingly relying on digital services.
My goal in this blog post was to consolidate my own thoughts and create a platform to continue this conversation with the IT community. However, if by this point in the text, you are thinking that some of the learnings can apply to your working environment, I’d be over the moon. Please let me know and share your thoughts.





