
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.
A runbook is a step-by-step technical guide for resolving a specific, known failure. A playbook is a higher-level document that defines roles, communication protocols, and decision-making frameworks for a category of incident. Both are necessary, but neither prepares your team for the incidents that go off-script.
Every SRE team eventually asks the same question: "Should this be a runbook or a playbook?" The terms get used interchangeably, and that confusion creates real gaps in how teams respond when things go wrong. This article clarifies the distinction, explains when each document earns its place, and covers the critical limitation that neither document addresses: the novel incident your team has never seen before.
What Is a Runbook?
A runbook is a step-by-step technical guide for completing a specific operational task or resolving a specific failure mode. It provides procedures for known, repeatable operations: the database restart, the certificate rotation, the cache purge.
The defining characteristic of a runbook is its specificity. A good runbook for a database connection failure tells you exactly which commands to run, what healthy output looks like, and what to do if a step fails. The value is in removing diagnostic guesswork so the on-call engineer can act without improvising.
Runbooks originated in operations teams managing infrastructure, where repeatable procedures like restarts, failovers, and log purging were frequent and time-sensitive. In modern SRE practice, they live alongside alerts: when PagerDuty fires, the linked runbook tells the on-call engineer exactly where to start.
What a runbook typically contains:
- A description of the alert or failure condition that triggers its use
- Step-by-step diagnostic commands with expected outputs
- Mitigation steps, ordered by likelihood of success
- Rollback instructions
- Escalation criteria: when to stop following the runbook and call for help
Runbooks need updating after every post-incident review that reveals a gap. The incident response runbook guide covers the full lifecycle of building and maintaining runbooks that hold up under pressure.
What Is a Playbook?
A playbook is a strategic coordination document that defines how a team responds to a category of incident. Where a runbook answers "how do I fix this specific thing?", a playbook answers "who does what, who decides, and who gets told?"
Playbooks outline roles, communication plans, and decision-making frameworks. A written playbook removes ambiguity when stakes are highest and delivers consistency regardless of who is on call. The canonical example in enterprise engineering is the ITIL Major Incident Management playbook, which Morgan Collins references in the Uptime Labs Incident Management Roles and Responsibilities framework as a foundational model for modern incident response.
What a major incident playbook typically contains:
- Severity definitions and the criteria for declaring a major incident
- Roles and responsibilities (Incident Commander, Communications Lead, Technical Lead, Scribe)
- Escalation paths and stakeholder notification schedules
- Communication templates for internal and external updates
- Decision frameworks for common judgment calls (when to roll back, when to failover, when to engage a vendor)
Runbook vs Playbook: The Key Differences
The simplest way to hold the distinction is this: runbooks fix the technology, playbooks coordinate the humans.
In practice, playbooks contain or reference runbooks. A disaster recovery playbook might list every system that needs to come back online, who owns each system, and in what order. The runbooks sit underneath, providing the step-by-step procedure for restoring each individual service. The playbook is the container. The runbooks are the tools inside it.
When Should You Use a Runbook vs a Playbook?
The choice depends on what kind of problem you are solving. A technical failure with a known fix is a runbook problem. A coordination challenge involving multiple teams, stakeholders, and decisions is a playbook problem. Most major incidents involve both.
Use a runbook when:
- A specific alert has fired and the fix is known
- You are onboarding a junior engineer to the on-call rotation and need them to handle common failures independently
- You are automating a response and need a machine-readable procedure
- You are documenting a fix discovered during a post-incident review so it is reusable next time
Use a playbook when:
- A major incident has been declared and multiple teams need to coordinate
- You need to define who has authority to make decisions under pressure
- You are preparing for a known high-risk event (a product launch, a major sale, a migration)
- You need to deliver consistent stakeholder communication regardless of who is on call
Use both when:
- A Sev-1 is in progress. The playbook governs how the incident is run. Individual runbooks are pulled by the technical team as specific failure modes are identified.
For a deeper look at the roles that appear in both documents, the incident management roles guide covers when each role is needed and what it is responsible for.
How Runbooks and Playbooks Work Together in Practice
Consider a payment service outage on a high-traffic day. The sequence looks like this:
- An alert fires. The on-call engineer opens the linked runbook for "Payment gateway timeout errors."
- The runbook steps do not resolve the issue. The engineer escalates.
- The Incident Commander declares a Sev-1. The major incident playbook activates.
- The playbook defines who joins the bridge call, who handles external communications, and who has authority to trigger a failover.
- Inside the bridge, individual engineers pull additional runbooks as they investigate specific subsystems.
- The playbook governs the cadence of stakeholder updates throughout. The runbooks govern the technical investigation.
This is the intended relationship. The playbook is the container. The runbooks are the tools inside it.
Why Runbooks or Playbooks Are Not Enough on Their Own
Both runbooks and playbooks are built on the same assumption: the incident you are facing is one you have seen before, or at least one you anticipated. A runbook for a database connection failure is only useful if the problem is a database connection failure. A playbook for a Sev-1 is only useful if the incident behaves the way you expected a Sev-1 to behave.
Novel incidents, the ones involving cascading failures across multiple services, unexpected interactions between systems, or failure modes you have never documented, do not follow the script. When that happens, your team needs something no document provides: the judgment to form a working theory under pressure, the communication instincts to keep a bridge call productive, and the confidence to make a call with incomplete information.
Stuart Cheverton, Product & Technology Lead at Uptime Labs, makes this case directly: when incidents hit genuine surprise territory, runbooks and playbooks become less effective, improvisation becomes more important than following a plan, and psychological safety matters more than ever. Documentation captures what your team already knows. It does not build the adaptive capacity to handle what they have never encountered.
This is the gap that incident response training fills.
How Incident Response Simulations Trains What Documentation Cannot
Uptime Labs builds that practice. The simulations are not tabletop exercises where teams talk through a scenario. They are hands-on, technically accurate environments where engineers diagnose real failures using real tooling, under the same time pressure and stakeholder noise they would face in a live incident. The skills that build in simulation (triage discipline, communication clarity, working-theory formation) carry directly into the incidents that no runbook anticipated.
Teams also discover where their documentation breaks down. A runbook that reads clearly in a planning session may fall apart when an engineer is following it at 3 AM with a dozen Slack messages arriving in parallel. A playbook's escalation path may stall when the person it names is unavailable. Simulation surfaces those gaps before a real incident does, and the findings feed directly back into better runbooks and better playbooks.
If your team has solid documentation but has never practiced the scenarios those documents don't cover, explore Uptime Labs' incident simulations or read the complete guide to incident response training to understand what a structured training programme looks like.
FAQs: Runbook vs Playbook
What is the difference between a runbook and a playbook in incident response?
A runbook is a step-by-step technical guide for resolving a specific, known failure mode. A playbook is a strategic document that defines roles, communication protocols, and decision-making frameworks for a category of incident. Runbooks fix the technology. Playbooks coordinate the people.
Can a runbook and a playbook be the same document?
They can be, but it creates confusion in practice. Runbooks are written for engineers executing technical steps. Playbooks are written for Incident Commanders making coordination decisions. Mixing them produces a document that serves neither audience well. Keep them separate and reference runbooks from within the relevant playbook sections.
How does a runbook relate to MTTR?
A well-maintained runbook directly reduces MTTR by removing the diagnostic guesswork from known failure modes. Instead of an engineer starting from scratch, they follow a proven path to resolution. The caveat is that runbooks only help with failures you have already seen and documented. Novel failures require judgment that no runbook provides.
When should I create a new runbook vs updating an existing playbook?
Create a new runbook when a post-incident review reveals a repeatable failure mode that your team had to figure out from scratch. Update your playbook when the incident exposed a gap in coordination, communication, or decision authority rather than a gap in technical procedure.
How does Uptime Labs help teams get more value from their runbooks and playbooks?
Uptime Labs runs realistic incident simulations that put engineers into the scenarios their runbooks and playbooks are designed for. Teams discover which runbooks are incomplete, which playbook steps break down under pressure, and where the judgment gaps are. Those findings feed directly back into better documentation and better-prepared responders.





