
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.
Incident management roles define who does what during an outage. The five core roles are Incident Commander, Incident Communications, Incident Triage (Tech Lead), Incident Documentation (Scribe), and Incident Liaison Owner (LNO). Small teams combine roles; larger teams separate them. The right structure depends on your headcount, not on copying what a 500-person org does.
Most guides on incident management roles describe a full roster and assume you have the headcount to fill it. You probably do not, and that is fine. The real question is not "what are all the roles?" but "which roles does my team need right now, and when do I add the next one?"
This article maps the five core incident management roles to the four growth stages most SRE and DevOps teams move through. By the end, you will know exactly which roles to combine at 3-5 engineers, when to split them at 8-15, what to formalise at 20-50, and when a Liaison Owner finally earns its place at 50+. For the complete practitioner-level guide to performing each role, including recommended behaviours, anti-patterns, and real incident stories, download the Uptime Labs Incident Management Roles and Responsibilities guide.
The 5 Core Incident Management Roles: Definitions and Summary
Before covering when to add each role, it helps to know what each one actually does. These definitions follow the taxonomy developed by Morgan Collins, former Incident Management Architect at Salesforce, in the Uptime Labs Roles and Responsibilities framework.
1. Incident Commander (IC)
The Incident Commander is the individual accountable for the overall management of an emergency response. This person leads the response team, sets immediate objectives, and manages the deployment of resources to resolve incidents efficiently. The IC does not touch the code. Their job is to hold the 30,000-foot view: directing people, sequencing decisions, and keeping the response moving. One important tension to understand early: the IC must have genuine organisational authority to match their responsibility. This is sometimes called the responsibility-authority double bind, where an IC is accountable for the outcome but lacks the authority to make critical decisions under pressure. Without both, the role becomes a title without teeth.
2. Incident Communications
The Incident Communications role owns all stakeholder-facing messaging during an incident. This means status page posts, internal stakeholder messages, and anything that leaves the incident channel. The Communications Lead provides regular updates to stakeholders and acts as a point of contact for incoming communications. Separating this role from the IC is what stops the IC from getting pulled into a constant stream of "any update?" messages from executives and customer success teams.
3. Incident Triage (Tech Lead)
The Triage role, sometimes called the Tech Lead, owns the technical investigation. Triage works alongside the IC but in a distinct lane, applying operational tools to scope the problem, form working theories, and execute system changes. The operations team should be the only group modifying the system during an incident, keeping technical execution cleanly separated from the IC's coordination function.
4. Incident Documentation (Scribe)
The Scribe captures a real-time record of the incident: what was tried, when, by whom, and what the result was. This is not a passive note-taking job. A good Scribe surfaces timeline gaps, flags unanswered hypotheses, and produces the raw material that makes a post-incident review worth running. Without a dedicated Scribe, this record either falls on the IC (who cannot afford the distraction) or does not exist at all.
5. Incident Liaison Owner (LNO)
The LNO is deployed when an incident generates enough parallel working groups that the IC can no longer coordinate them all from the primary communications channel. The LNO acts as the point of contact between Incident Command and a separated swimlane, filtering requests between the two and monitoring for miscommunications or work collisions across channels. This role only appears at scale and requires both strong attention management and enough technical knowledge to track what is happening on both sides of the separation.
Why Clear Incident Management Roles Matter
In a high-pressure enterprise outage, "who does what" is often more important than "what is broken." Without clearly defined roles, you end up with "heroics," where senior engineers burn out trying to fix, communicate, and manage simultaneously. One person is triaging the technical problem, posting status updates, fielding questions from leadership, and trying to keep a timeline of what has been tried. That is not a role. That is four roles collapsed into one person under pressure.
The heroics model has a ceiling. It works until your team grows, your systems get more complex, or your most experienced engineer leaves. Formulating rules about how to communicate and coordinate your efforts before disaster strikes allows your team to concentrate on resolving an incident when it occurs. The alternative is that every incident becomes a scramble where coordination overhead eats into resolution time.
Role separation is not bureaucracy. It is the mechanism that lets your engineers focus on the problem instead of managing the process at the same time. When roles are clearly defined and practised, incidents resolve faster, not because of any single structural change but because the people responding can actually think. For more on the structural factors behind on-call pressure, read How to Reduce On-Call Burnout in SRE Teams.
When to Add Each Incident Management Role as Your Team Scales
Incident Management Roles by Team Size: Quick Summary
Here is a quick reference for role allocation by team size:
Tier 1 (3-5 Engineers): One Person Holds Everything
At this stage, your on-call team is small enough that formal role separation is impractical. The engineer who picks up the alert is the IC, the Triage lead, the Scribe, and often the person posting the status update.
What to do:
- Designate one person as the named IC for each incident, even informally. The discipline of naming someone matters more than the title.
- Use a shared incident channel from the start. Even if it is just you and one colleague, a channel creates a searchable record.
- Accept that roles will collapse into one person. The goal here is not structure; it is the habit of declaring incidents and writing brief timelines.
What to watch for:
The risk at this stage is not role confusion; it is no structure at all. If every incident is handled differently, you will have nothing to build on when you scale. Start with a consistent declaration practice, even if the rest is improvised.
The signal that you have outgrown this stage is when incidents start taking longer not because of technical complexity but because one person is context-switching between investigating, communicating, and coordinating.
Tier 2 (8-15 Engineers): Split IC from Triage
This is the most important transition in incident management role design. Once you have 8 or more engineers, you have enough people to separate the person coordinating the response from the person doing the technical work. These two jobs are cognitively incompatible when combined under pressure.
What to add:
- A dedicated IC who does not touch the terminal during the incident.
- A Triage lead (Tech Lead) who owns the investigation and any system changes.
- A basic Scribe function, which can rotate between whoever is not on primary Triage.
At this tier, IC rotation also becomes practical. Long incidents can have multiple commanders working in shifts, we recommend rotations of 4-6 hours on major incidents. Beyond that window, cognitive fatigue degrades decision quality in ways that are hard to self-assess. At 8–15 engineers, you have enough people to make this rotation real.
The split between IC and Triage also changes how escalation decisions work. When one person holds both roles, escalation is an internal judgment call. When they are separated, escalation becomes a coordination problem between two people with different views of the incident. Getting that handoff right is one of the first real tests of a scaling incident process. For more on the judgment calls involved, see Too Soon or Too Late: The Incident Escalation Dilemma.
A pattern from the field:
The Uptime Labs Roles and Responsibilities guide includes a story about a junior Triage engineer on a weekend shift who encountered an unfamiliar failure pattern in a legacy environment. The engineer had the technical ability but was new to the organisation and couldn't recognise what they were looking at. Rather than page their senior mentor who was on vacation, they continued investigating alone. The incident extended until a routine comms update surfaced the confusion and the mentor called in. It is a useful read for any Head of SRE thinking about how knowledge gaps and hesitancy to escalate can compound when the Triage role is new to someone.
What to watch for:
The signal that you have outgrown this tier is when the IC starts spending more time fielding stakeholder questions and drafting status updates than actually coordinating the response. That is the point where Communications needs to become its own function.
Tier 3 (20-50 Engineers): Formalise Communications and Documentation
At 20–50 engineers, your incidents start crossing team boundaries. A Sev-1 now involves the platform team, the product team, a customer success manager, and possibly a third-party vendor. The IC cannot manage all of that and still run the response effectively.
What to add:
- A dedicated Incident Communications role, separated from the IC entirely.
- A formalised Scribe function, with a template and a named person on every major incident.
- Documented on-call structure and escalation paths, not just tribal knowledge.
The communications breakdown problem:
A specific failure mode appears at this scale: when incidents span multiple teams and time zones, each group develops its own update rhythm, its own channel, and its own version of what is happening. The Incident Communications role exists to collapse those parallel streams into one authoritative source of truth. Without it, the IC ends up fielding the same questions from different stakeholders instead of coordinating the response.
The Uptime Labs Roles and Responsibilities guide includes a detailed account of how this plays out in practice: a 16-hour incident where swimlane labelling fragmented (teams disagreeing on whether they were working swimlane 4b, 4b2, or swimlane 40) while timezone alignment collapsed simultaneously across UTC, PST and EST. The result was that nobody on the bridge could agree on what had been done, what was in progress, or what time anything was expected to complete. It is worth reading before you encounter the same problem live.
On-call structure at this tier:
Incident roles do not follow reporting chains and instead are based on knowledge and incident context. This is worth stating explicitly at the 20–50 tier, because it becomes a point of friction. Engineers who are more senior in the org chart will sometimes resist being directed by a less senior IC. The structure only works if the team has internalised that incident authority is situational, not hierarchical. This is a training problem as much as a process problem, and it compounds the on-call pressure that already builds at this team size.
What to watch for:
The signal that you have outgrown this tier is when the IC can no longer track all active swimlanes from the primary communications channel. When working groups need to break out into separate channels to avoid bottlenecking each other, you need someone bridging those channels back to Incident Command. That is the LNO role.
Tier 4 (50+ Engineers): Add the Liaison Owner
The Incident Liaison Owner (LNO) is the last role to add, and it earns its place only when your incidents are generating enough parallel swimlanes that the IC can no longer coordinate them all from the primary communications channel. When working groups break out into separate channels, someone needs to bridge those conversations back to Incident Command. That is the LNO.
What to add:
- A named LNO for every Sev-1 and Sev-2 incident where working groups are separated from the primary channel.
- Clear handoff protocols between the LNO and the IC so that requests between command and the working group are filtered and prioritised rather than passed through raw.
- A defined boundary: the LNO represents Incident Command on the separated channel and monitors for miscommunications or work collisions across swimlanes.
A pattern from the field:
The Uptime Labs Roles and Responsibilities guide includes a story about an LNO who was asked to bring a separated working group back to the primary channel. The team pushed back because executives on the main bridge had been interrupting their investigation. The LNO had to navigate between the IC's coordination needs and the working group's legitimate concerns, ultimately working with the IC to move the executives to a dedicated briefing channel before the team would rejoin. It is the kind of political and coordination challenge that only surfaces at scale, and it requires both interpersonal skill and organisational authority to resolve.
What full staffing looks like:
At 50+ engineers, a fully staffed major incident response has five active roles: IC, Communications, Triage (Tech Lead), Scribe, and LNO. A clear separation of responsibilities allows individuals more autonomy than they might otherwise have, since they need not second-guess their colleagues. Each person knows their lane. The IC is not answering Slack messages from the CTO. The Triage lead is not writing status updates. The Scribe is not making technical decisions.
3 Mistakes Teams Make When Scaling Incident Management Roles
1. Defining all 5 roles before the team can fill them.
The fully staffed model described in Tier 4 is an end state, not a starting point. Imposing it on a team of 10 creates ghost roles that nobody owns. In practice, what happens is the IC ends up absorbing every unfilled role by default, which is worse than not having defined them at all because now there is an expectation of coverage that does not exist. Start with the roles your team can actually staff and add the next one when the coordination signals tell you to.
2. Splitting roles without building competency in them.
Adding "Incident Communications" to someone's on-call rotation is a structural decision. Whether that person can actually draft clear stakeholder updates under pressure, filter what information is appropriate for each audience, and maintain messaging cadence across a multi-hour incident is a competency problem. The same applies to every role in the framework. An LNO who has never managed attention across two channels simultaneously will become the bottleneck they were designed to prevent. Roles without trained people behind them are titles on a spreadsheet.
3. Treating the role structure as permanent.
The response structure that works at 15 engineers will break at 30. Teams that do not revisit their role model as they scale end up with the same coordination bottlenecks they tried to solve by formalising roles in the first place. The tiers above are a framework, not a checklist. Review your role structure after any incident where coordination was the primary source of delay, and be willing to split or consolidate roles based on what you find.
How Uptime Labs Trains Teams for Each Incident Management Role
Knowing which roles to staff is the structural problem. Knowing whether your people can actually perform those roles under pressure is the training problem. They are different problems, and most teams only solve the first one.
The Uptime Labs Roles and Responsibilities guide frames the training objective clearly: train for capabilities and competencies, and simulate the experience rather than the scenario. That means the goal of incident training is not to rehearse a specific outage type. It is to build the underlying competencies that transfer across all incident types: Identify Scope, Incident Mechanics, Internal Communications, External Communications, and Command Incident.
Uptime Labs maps these five competency categories to defined proficiency levels, from Practitioner through Expert, so you can assess where each engineer sits and track progression over time. An engineer who scores at Practitioner level on Command Incident is not ready to be the IC on a cross-team Sev-1. An engineer at Expert level on Identify Scope can be trusted to run Triage on an unfamiliar system. That precision is what separates measurable readiness from guesswork.
Explore Uptime Labs incident response training to see how simulations build role-specific competency or run a free incident simulation and test how immersive it is for yourself.
FAQs:
What are the core incident management roles in SRE?
The five core incident management roles in SRE are: Incident Commander, Incident Communications, Incident Triage (Tech Lead), Incident Documentation (Scribe), and Incident Liaison Owner (LNO). Small teams combine these into fewer people; larger teams separate them. The IC is the single accountable decision-maker in all cases.
When should I separate the IC from the Triage lead?
Separate the IC from the Triage lead when your team reaches 8 or more engineers. Below that threshold, one person can cover both. Above it, combining the roles creates a cognitive overload that extends resolution time. The IC should not be modifying systems during an incident; the Triage lead should not be managing stakeholders.
What is the incident commander role responsible for?
The Incident Commander is responsible for overall direction of the response: setting objectives, assigning roles, making escalation decisions, and maintaining situational awareness. Without an Incident Commander, incident response can quickly become chaotic. Multiple people may try to lead at once, critical tasks may be missed, and communication channels can break down, prolonging the incident unnecessarily. The IC does not perform technical remediation.
What is a Liaison Owner (LNO) in incident management?
The LNO is the point of contact between Incident Command and working groups that have been separated from the primary communications channel. The role appears when incidents are complex enough to require parallel swimlanes that the IC cannot coordinate directly. LNOs monitor for miscommunications and work collisions across channels. It is the last role to add as a team scales and is typically only needed at 50+ engineers handling Sev-1 and Sev-2 events.





