
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
SREs, SWEs, and security teams increasingly share responsibility when it comes to breaches. These are the teams that build, maintain, and — critically — understand the systems. Whether security grows out of engineering or exists as a separate function, the collaboration is tight and getting tighter.
Everyone in the room is trying to do a great job. But, if you’ve got different mental models for what “doing a great job” looks like, you see tangible friction. I’ve been in rooms where that friction created frustration instead of collaboration — not because anyone was wrong, but because we were solving for different things and didn’t know it.
I started as a software engineer. Twenty minutes into my first on-call shift at 6:20AM, I got my first alert. Over the years, I took on more responsibility — triage, then escalation, then incident command. I designed on-call workflows, wrote playbooks, and ran incident response for engineering teams. My path into cybersecurity started when our CTO pulled me into a live security incident and asked me to help respond. That experience got me onto the security team, and ultimately got me to writing this article.
I brought my engineering on-call instincts with me and immediately hit friction. Process that felt slow. Communication that felt restricted. Decisions that seemed to optimise for the wrong things. It took mentors, mistakes, and a few “aha” moments before I understood: The process wasn’t inefficient — it was built for a different problem with different constraints. I just couldn’t see them yet.
This is hard to write about. Blue teams can’t easily share war stories — the specifics of your infrastructure, detection capabilities, and incidents are confidential, often with legal restrictions on disclosure. Sharing them publicly risks creating the very exposure your team worked to prevent. So the lessons stay locked in private retrospectives and tribal knowledge, and the frustrations stay unspoken. This series is an attempt to put them into words. Part 1 covers **why** security IR is different — the underlying mental model.
Incident Timeline
The scenario below is fictional, but close to reality. If you’ve been involved in a security incident, you’ll recognise the shape of it.
15:02 — Alert fires. CPU usage elevated across the application tier, latency creeping up. Not a full outage — but customers are seeing slower responses. You acknowledge the page and open `#inc-20260315-payments`. Post the alert, tag the team. SEV-2.
15:08 — First engineer triages. Dashboards, recent deployments, traffic patterns — nothing obvious. Escalates to the on-call senior.
15:20 — More engineers join the channel. Customer care too: “We’re getting tickets about slow payments. Any ETA?” Someone drafts a status page update.
15:55— Almost an hour in, still no root cause. The usual suspects are clean. Customer care is pushing to post the status page update — tickets are climbing.
16:15 — Engineers have been going deeper — individual service metrics, database connections, third-party API response times. One of them spots a process running on two application servers. Not in any deployment manifest, not in any runbook. Consuming enough CPU to explain the degradation.
16:20 — Discussion picks up with findings, links to the servers, and process details. Engineers start sharing screenshots, metrics, theories. Someone suggests restarting the servers. Someone proposes a rollback. Another shares a stack trace and asks if it could be crypto miner. Customer care wants to know what this means for the ETA.
16:35 — A DM from the security lead: “Hey — I saw the thread about that process. Can you hold off on restarting those servers for now? And hold the status page update too. I need to set up a private channel, I’ll explain there. Just please don’t post any more details about it in the public channel.”
16:40 — You get added to `#sec-20260315`. Private. Invite list: you, security lead, engineering manager, senior customer care lead, legal on-call. Boilerplate from the security lead: “This is being treated as a potential security incident. Do not share any details outside of this channel.” Legal echoes: “All communication about this issue stays here until further notice.” They tell you to post one message in the public channel: “Investigating with a smaller group to reduce noise — updates to follow.” Nothing else.
16:45 — The security lead sets up a Zoom with the private channel group. They want the process kept alive — it’s their best window into what’s happening, and the customer impact is degraded performance, not data loss. Killing it before they understand the scope costs more than it saves. You push back: “Can we at least update the status page?” Legal says no — not until they’ve assessed what can be disclosed. The customer care lead pushes too — enterprise clients are escalating. Legal holds the line. You start to understand this might not resolve today.
16:55 — DMs from the engineers: “What happened?” “Did you find the root cause?” “Want me to just restart those servers?” You reply to a few — “Security team is looking into it, hold tight on the servers for now.” You’re giving them something without giving them anything. One of them pushes: “Looking into what? Is it the process? What did you find?” You reply: “Can’t get into details right now, sorry. Will update when I can.”
Two hours ago you had degraded-performance. Now you’re on a Zoom with too many people, your colleagues are asking what happened, customers are waiting on a status update that isn’t coming, and the legal team is correcting which words you use in the way of a productive conversation.
The rest of this article explains the principles behind each of those shifts — so that next time, it makes sense before you feel the friction.
Driver 1: Root Cause — Accidental vs. Adversarial
The primordial difference between an outage and a breach is the root cause.
Outages are often caused by accidents. Misconfigurations. Resource constraints. Bad deployments. They can be devastating, but the causes don’t hide from you, and they don’t change their behaviour when you start investigating. Once you find the root of the problem, it stays found.
Breaches are caused by someone with intent and goals. They are caused by people — diverse, adaptive, and motivated to get something done. And, if they realise you’ve detected them, they accelerate their objectives, destroy evidence, or simply go quiet and wait.
Outages are PvE, but breaches are PvP.
What this drives: What can look like hesitation to act from the security team — because the playbook that works for a broken system won’t apply to working against a human being.
What this means for you: You’re not ready to ship a fix until you can picture the person behind the breach — how they work, how they’ll react to your actions, what their goals are, and what collateral there will be at your expense. You’re not working on a system. You’re working against a person.
Driver 2: Timeline — Starting at the Beginning vs. Dropped in the Middle
When an outage starts, you know you’re at the beginning. The alert is the starting point. You work forward from here.
If you think you’ve detected a breach, you’re in the middle of an unknown timeline — and you don’t even know it’s a breach yet. Maybe it’s mundane, a non-breach incident, or something worse. You might be three minutes in or three weeks in. Even an obvious spike of traffic may not mark the beginning — someone may have been poking around weeks earlier at lower volume.
The goal is the same as any outage: Investigate what happened so you can take action. But the chain of events is long, murky, and full of gaps. Trying to answer basic questions, like “who did it”, “why”, and “what did they do” for each incident is often an impossible ask. All you get are pieces.
Just as we assess severity by understanding who’s impacted and how badly, security teams draw on the clarity they *can* observe, identify gaps, and piece together the chain of attack. That chain, however incomplete, is your orientation for deciding the course of action.
What this drives: Not having answers to basic questions — and no one trying to answer them. You’re used to resolving that uncertainty. In breach response, we learn to act within it.
What this means for you: Your goal is not to resolve uncertainty. The chain you draw — even with gaps — is enough to make decisions. Trust the process of refining it as you go.
Driver 3: Recovery — Restoring a State vs. Disrupting a Chain
The goal is clear with outages: Get the status page green, stay within SLO, restore the system to operate as expected. So people can get on with getting things done. Find the cause, fix it, confirm recovery. Done.
But, the goal is different for breaches: It’s to disrupt a chain of steps the intruder has been executing — at the right points. To contain, eradicate, and only then restore systems. Otherwise, you haven’t actually stopped anyone — they’re free to regroup and come back, potentially using the same approach that got them in the first time. The breach isn’t resolved.
These goals can conflict. Restoring the system doesn’t necessarily disrupt the intrusion. Disrupting the intrusion doesn’t necessarily restore the system.
What this drives: Working towards different definitions of “recovery” creates significant friction — especially when those differences aren’t acknowledged, which is often the case.
What this means for you: You’re not working towards a shared goal until you’ve said it out loud. There will be trade-offs, and someone needs to own the call. The best time to figure out all these details is during incident response training, not during the incident itself.
Driver 4: Response — Effective vs. Defensible
In resolving an outage, it’s about outcomes: Systems work as expected. Whether that’s internal SLOs, or contractual SLAs — the focus is on getting there. How you run your incident response, how you learn from it — that’s largely up to you.
Breach obligations result in a lot more emphasis on the process. Regulations define what constitutes a breach, who must be notified, and when. Contracts — including cyber insurance — may dictate specific actions, like hiring a particular forensics firm. Being able to defensibly demonstrate that you acted appropriately matters as much as the outcome itself.
The outcome still matters. But the weight shifts from outcome to process — and that shift drives everything else.
What this drives: What can feel like rigid, industry-standard processes. Tight control on communications. Legal in the room from the start — because regulations, contracts, and defensibility demand it.
What this means for you: Your actions need to be defensible under pressure, because it’s not possible to prevent all damages, and accusations of not doing the right thing at the right time have real consequences — even if you justifiably took the best course of action.
Does this result in trade-offs between effective vs. defensible response, or does one help the other? I’ll leave that as an exercise for the reader. Share your experiences — I’d love to hear about it.
Driver 5: Core Principles — 99.999% vs. CIA
Site reliability is built around 99.999%. In security framing, that’s Integrity and Availability — keeping systems online and working as expected. Security adds one more principle: Confidentiality. Together, the CIA triad.
In practice, security teams care about uptime as much as SREs. And, SREs care about confidentiality — we’re all working on security every day, just through different lenses. So, what’s the actual difference?
It’s the focus of our jobs. SREs dedicate a lot of effort to uptime and system integrity. Security teams maximise one more thing — you’ve guessed it — confidentiality. That one difference shapes the way we work: Open by default — collaborative, transparent, all-hands — or closed by default — confidential, restricted, need-to-know. Think private channels instead of public. Tactical groups instead of broad collaboration.
What this drives: Which is the right default? That tension — between wanting more openness, collaboration, and transparency, or minimising exposure by keeping things confidential — is at the heart of how SRE and security teams experience incidents differently.
What this means for you: When things go quiet and access gets restricted during a breach, it’s not a lack of trust — it’s confidentiality as a default. Knowing that helps you work within it rather than against it.
What’s Next?
These five drivers — adversarial root cause, unknown timelines, competing definitions of recovery, defensibility over effectiveness, and confidentiality by default — are behind every moment of friction between using the outage playbook for responding to security incidents. Naming them is the first step.
In Part 2, we’ll get into the practical differences: The incident response process, how communication changes, and which familiar words will mean different things in the context of breaches.
If you’ve been through a security incident and recognised the friction — share this with the people who were in that room with you. Half the battle is having shared vocabulary for what happened and why.





