When AWS Sneezes: Reflections on Resilience and Reality

Joe Mckevitt

October 27, 2025

Taggs:

IN THIS ARTICLE

Incident Timeline

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Book a demo

Explore the platform

This week’s AWS regional outage has been well covered, including the technical cause, timelines and the long list of companies affected.
But what’s more interesting, at least to me, is what happens afterwards inside those thousands of companies that suddenly find themselves impacted.

Because right now, across engineering floors, Slack channels and exec reviews, there are a few recurring themes playing out. People are:

Asking ‘why aren’t we as resilient as we thought'?
Discovering unexpected third-party dependencies
Scrambling to understand escalation routes
Reflecting on how well (or poorly) they communicated with customers while the lights were out.

These are the post-incident conversations that matter most; the ones that shape how resilient organisations really become. Every incident is a learning opportunity—even if it's someone else's.

1. The Surprise of Not Being as Resilient as We Thought

If you were in one of the companies affected, there’s a good chance someone in the room said:

“Wait… why aren’t we multi-regional?”
“Shouldn’t we have survived a single region outage?”

Those moments are humbling.
Many teams genuinely believed they were resilient until they weren’t.
Then came the internal investigations: why weren’t we more distributed? What assumptions did we make? Where are the hidden single points of failure that we never noticed?

The truth is, most organisations only discover the real shape of their resilience when it’s tested under pressure—aka when it drops in Prod and we find out in real time.

2. The Domino Effect of Third-Party Dependencies

A few companies I know weren’t even using the affected AWS region directly, yet still went down.
Why? Because a third-party provider they relied on was.

That third-party dependency became the single point of failure.
They didn’t expect a regional issue at AWS to take their vendor offline, and they definitely didn’t expect that vendor’s outage to cascade into their own operations.

It’s a perfect illustration of how fragility travels through dependencies.
Even if your own systems are fault-tolerant, your suppliers’ architectures might not be, and that risk becomes yours the moment you integrate with them.

3. The Scramble for Contacts and Escalations

Inside many ‘control rooms’ this week, there were probably two conversations running in parallel.

One was the technical triage teams trying to stabilise their own systems.
The other was the external scramble, people digging through old resources (emails? wiki pages?), trying to find who to call at the third-party to get prioritised.

Some companies had this nailed. They had support channels, escalation paths and named account contacts ready. Others were, frankly, improvising under pressure, hunting for numbers, opening tickets, hoping someone would reply.

It’s a small detail, but it makes a huge difference when time really matters.

4. Customer Communication

AWS’ own communication cadence has been analysed endlessly, but I keep wondering:
How did the thousands of affected companies handle their own customer updates?

Did they acknowledge the issue early?
Did they provide regular updates even when there was no change?
Did they help customers understand what was happening, or did they go quiet until things recovered?

It’s not easy. Keeping customers informed while engineers are still diagnosing the issue is a balancing act.
Too little communication looks evasive. Too much can disrupt the technical work.

After events like this, every team tends to ask the same questions:

“Did we get the tone right?”
“Did we keep the flow steady?”
“Did we protect the customer relationship while staying focused on the fix?”

Those are hard questions, but they’re the right ones to ask.

5 . Heroes run to the fire

It was also a moment for many people to shine.
Across those control rooms, engineers showed courage and creativity, thinking on their feet, running rapid troubleshooting threads, and improvising Plan B strategies when the playbook ran out.
Customer teams, meanwhile, were holding the line with empathy and calm, reassuring customers that service would be restored, even when the situation was uncertain and updates from downstream dependencies were thin or overly optimistic. It’s worth asking whether our leadership team rose to the occasion - supporting incident responders and modelling steady, confident leadership, because calm is contagious.
In many ways, incidents like this reveal the best of people under pressure; the blend of technical problem-solving and human composure that keeps companies together in a crisis.

What I like to think is happening is that these heroes are getting the recognition they deserve! I’d like to hope that people are calling out areas that may not have been so helpful or effective for growth and improvement.

Come to think of it, these conversations are just as valuable for the lucky ones who narrowly missed the blast this time. Every incident is a learning opportunity for us all.

I've tried to build those moments of ‘it could’ve been us’ into Uptime Labs. Our simulated incident environments help teams experience real-world outages safely - building the instincts and communication needed when the real thing hits. You can try one yourself here and let me know what you think!

Joe Mckevitt

Joe is the co-founder and CTO of Uptime Labs. A passionate technologist and developer, he has 17 years’ experience in building and scaling high-performing products and teams. Also a marathon runner, he’s wired for high performance. He loves creating cultures of constant innovation, and coaching people to develop their full potential.

When AWS Sneezes: Reflections on Resilience and Reality

Ready to make incident response your competitive advantage?

1. The Surprise of Not Being as Resilient as We Thought

2. The Domino Effect of Third-Party Dependencies

3. The Scramble for Contacts and Escalations

4. Customer Communication

5 . Heroes run to the fire

Joe Mckevitt

Why Teamwork Makes (Or Breaks) Your Incident Response

Unpacking the ‘People Side’ of Incident Response

The Incident Hero Trap

Ready to make incident response your competitive advantage?

When AWS Sneezes: Reflections on Resilience and Reality

Ready to make incident response your competitive advantage?

1. The Surprise of Not Being as Resilient as We Thought

2. The Domino Effect of Third-Party Dependencies

3. The Scramble for Contacts and Escalations

4. Customer Communication

5 . Heroes run to the fire

Joe Mckevitt

Related content

Why Teamwork Makes (Or Breaks) Your Incident Response

Unpacking the ‘People Side’ of Incident Response

The Incident Hero Trap

Ready to make incident response your competitive advantage?