What Is Chaos Engineering? How It Works & Core Principles

Edward Page (Community Contributor)

March 31, 2026

IN THIS ARTICLE

Incident Timeline

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.

Book a demo

Explore the platform

Chaos engineering is the discipline of deliberately injecting controlled failures - such as server shutdowns, network latency, and dependency outages - into production or pre-production systems. The goal is to discover weaknesses before real outages expose them, testing whether a system can maintain acceptable performance under turbulent, real-world conditions.Modern software systems are too complex for any team to fully predict. Hundreds of microservices, third-party APIs, multi-region cloud deployments - a single misconfigured timeout or a cloud provider's transient network partition isn't an edge case. It's Tuesday. Chaos engineering accepts that these failures are inevitable and shifts the focus from prevention to resilience: building systems that absorb faults and continue operating within acceptable thresholds. Chaos engineering is a scientific practice - sometimes called resilience testing - built on hypotheses, measured observations, and incremental learning.

The Origins of Chaos Engineering

In 2008, Netflix experienced a major database corruption that knocked out services for three days. The outage exposed a fundamental problem: their infrastructure had single points of failure that nobody had tested.When Netflix committed to migrating to Amazon Web Services, things got harder. Cloud instances could be terminated at any moment. Network partitions between availability zones were a fact of life. Rather than trying to engineer all that unpredictability away, Netflix engineers built Chaos Monkey in 2010.Chaos Monkey was built to randomly terminate production instances during business hours. The logic was simple: if instances were going to fail anyway, force engineers to design for it by default rather than as an afterthought.Chaos Monkey proved the concept, but killing individual instances only tests one failure mode. Netflix expanded the approach into the Simian Army:

Latency Monkey injected artificial network delays between services
Conformity Monkey flagged instances that violated best practices
Chaos Gorilla simulated entire availability zone outages

The discipline progressed from "can we survive losing one server" to "can we survive losing an entire region".The practice spread beyond Netflix:

Amazon's GameDay programme introduced scheduled exercises where engineers triggered failure scenarios against production
Google's DiRT (Disaster Recovery Testing) tested both infrastructure and organisational response
principlesofchaos.org codified the methodology into a set of community-established principles
LitmusChaos graduated as a CNCF project, signalling that the Kubernetes ecosystem considered chaos engineering a mainstream practice

What was once exclusive to tech giants became accessible to mid-market engineering teams through open-source tools and managed cloud services.

How Chaos Engineering Works

Every chaos experiment follows a structured, repeatable cycle.

1. Define steady state

Before you test how a system handles failure, you need to know what "normal" looks like. Steady state is defined through measurable outputs that reflect the customer experience:

Request latency at p50, p95, and p99
Error rates
Throughput
CPU and memory utilisation

These metrics are your control group. Focus on outputs that users actually feel rather than internal system metrics that might fluctuate without visible impact.

2. Form a hypothesis

A chaos experiment starts with a specific, falsifiable prediction:

A strong hypothesis looks like this: "If the payment gateway returns 500 errors for 30 seconds, the checkout service will retry twice, fall back to queued processing, and p99 latency will increase by no more than 200ms. No customer-facing errors will surface."
A weak hypothesis looks like this: "The system should handle the payment gateway going down."

The difference is specificity. The strong version defines the failure condition, the expected response, and the acceptable performance envelope. When the experiment runs, you know exactly what to measure and what constitutes a pass or failure.If you can't articulate a hypothesis, that's a signal. It usually means you don't understand the system's expected degradation path well enough to test it yet - and that gap in understanding is itself a finding worth documenting.

3. Introduce a controlled fault

Inject a specific fault that mirrors a realistic failure scenario. Common categories include:

Infrastructure faults: Terminating instances or pods. Exhausting CPU, memory, or disk I/O. Simulating availability zone outages.
Network faults: Injecting latency between services. Dropping packets. Simulating DNS failures. Creating network partitions.
Application faults: Forcing dependency timeouts. Returning error responses from downstream services. Introducing clock skew. Corrupting data in transit.

The key constraint is blast radius. Start small - a single instance, a single service, a low percentage of traffic - and expand only after confirming your safety controls work and your observability captures the impact.

4. Observe and measure

Compare live system behaviour against your steady-state baseline. Your observability stack does the heavy lifting here:

Distributed tracing shows how the fault propagated through your service graph
Log aggregation captures error messages and fallback behaviour
Metrics dashboards display whether latency, error rates, and throughput stayed within the bounds your hypothesis defined

Without strong observability, chaos experiments produce noise instead of signal. You inject a fault, something goes wrong, and you can't tell whether the degradation matches your prediction or is unrelated to the experiment entirely.

5. Analyse, fix, verify

If the system behaved as predicted, the experiment passed and you've built confidence.If it deviated - latency spiked beyond the threshold, customer-facing errors appeared, a cascading failure propagated further than expected - you've found a weakness. Document the failure mode, prioritise a fix, and run the same experiment again after the fix ships to confirm the improvement.Each cycle either validates existing resilience or reveals the next weakness to address. Log the following results consistently:

Baseline metrics before the experiment
Observed metrics during experiment
The experiment outcome

This lets you track resilience improvements over successive runs and demonstrate the value of the practice to engineering leadership.

Core Principles of Chaos Engineering

The Principles of Chaos Engineering, published at principlesofchaos.org by practitioners who developed the discipline at Netflix and beyond, outline the rules that guide mature implementations.

Run experiments in production. Pre-production environments rarely replicate the complexity of live traffic patterns, data volumes, and infrastructure configurations. Production experiments give the highest-fidelity signal.
Minimise blast radius. Start with the smallest possible scope and expand only after building confidence. Every experiment should have automatic stop conditions tied to your observability alerts - if a key metric crosses a threshold, the experiment halts and the fault rolls back.
Automate and run continuously. Manual one-off experiments don't scale. Mature teams integrate chaos experiments into CI/CD pipelines so every deployment is tested. This is the progression from periodic GameDays to continuous chaos that catches regressions automatically.
Build hypotheses around steady-state behaviour. Focus on measurable system output (such as latency, throughput and error rates) rather than internal implementation details. Behaviour-level hypotheses survive code refactors. Implementation-level hypotheses break every time someone reorganises a module.

Chaos Engineering vs Traditional Testing

Traditional testing (such as unit tests, integration tests, end-to-end tests) validates that a system works correctly under known, expected conditions. Chaos engineering tests whether a system works well enough under unknown, unexpected conditions.In practice, the distinction looks like this: traditional testing asks "does this function return the correct value?" Chaos engineering asks "what happens to the entire checkout flow when the payment gateway returns 500 errors for 30 seconds?"One practical difference that catches teams off guard is observability. Traditional tests have built-in assertions. Chaos experiments depend entirely on external observability infrastructure to determine whether the system behaved acceptably. Without it, results are ambiguous.Both belong in a mature reliability strategy. Traditional tests prevent known regressions. Chaos experiments surface unknown systemic risks. Teams that rely only on traditional testing typically discover their most critical failure modes during real incidents.

Getting Started With Your First Chaos Experiment

For SRE and DevOps teams that have observability, defined SLOs, and a baseline understanding of their system's failure modes, the path from zero to a running experiment is shorter than most people expect.

Start in staging. Build team confidence, verify observability coverage, and confirm your safety controls work.
Pick a well-understood service. Clear ownership, comprehensive monitoring, known architecture. Not the most critical or least understood system.
Use a simple fault. Terminate a single pod. Inject 200ms of latency on one network path. Watch whether alerts fire, dashboards show the impact, and recovery mechanisms activate.
Run a GameDay. Gather the team, run the experiment live, and discuss results immediately. GameDays surface gaps in runbooks, alerting, and incident response processes that documentation reviews alone don't catch.
Move to production. Once comfortable, transition to production with tight blast radius limits. Notify affected teams before running experiments - surprises during production fault injection erode trust in the practice. Then gradually increase scope and frequency, with the end goal of embedding chaos experiments into your CI/CD pipeline so every deployment gets tested automatically.

Platforms like Gremlin, LitmusChaos, Chaos Mesh, and AWS Fault Injection Service each approach fault injection differently. For a detailed breakdown of how these tools compare and where incident response simulation fits into the picture, see our chaos engineering tools comparison guide.

What Chaos Engineering Doesn't Solve

Chaos engineering tests whether your systems can handle failure. It doesn't test whether your team can.A well-designed experiment might confirm that your payment service degrades gracefully when a database replica goes offline. But it won't tell you whether the on-call engineer who gets paged at 3 AM knows how to interpret that alert. It won't reveal whether your incident commander can coordinate five responders across three time zones under pressure. And it won't surface the fact that your escalation runbook still references infrastructure you decommissioned two quarters ago.System resilience and team resilience are different problems. Chaos engineering builds confidence in the first. Incident response simulations - whether tabletop exercises or live simulations - builds confidence in the second. The most reliable organisations invest in both.For a detailed breakdown of how these disciplines work together, see Chaos Engineering Tools vs Incident Response Simulations: What SRE Teams Actually Need.

Frequently Asked Questions

Is chaos engineering safe to run in production?

Yes, when done with proper safeguards. Every experiment should have automatic stop conditions that halt the fault if a key metric crosses a predefined threshold, plus rollback mechanisms that restore normal behaviour immediately. Start with a small blast radius - a single instance or a low percentage of traffic - and expand only as confidence grows. The risk of a well-controlled experiment is significantly lower than discovering the same failure mode during a real incident with no safety net.

How often should chaos experiments run?

That depends on maturity. Start with monthly or quarterly GameDay exercises. As confidence grows, increase to weekly experiments targeting specific services. The end goal is continuous automation - chaos experiments embedded in your CI/CD pipeline, catching regressions as soon as they're introduced.

What's the difference between chaos engineering and chaos testing?

The terms are sometimes used interchangeably, but there's a meaningful distinction. Chaos testing implies a discrete activity - inject a fault, see what breaks, move on. Chaos engineering implies the full scientific method: forming a hypothesis, running a controlled experiment, observing results against a steady-state baseline, and iterating. It's a continuous discipline, not a one-off test.

What Is Chaos Engineering? How It Works & Core Principles

Ready to make incident response your competitive advantage?

The Origins of Chaos Engineering

How Chaos Engineering Works

1. Define steady state

2. Form a hypothesis

3. Introduce a controlled fault

4. Observe and measure

5. Analyse, fix, verify

Core Principles of Chaos Engineering

Chaos Engineering vs Traditional Testing

Getting Started With Your First Chaos Experiment

What Chaos Engineering Doesn't Solve

Frequently Asked Questions

Is chaos engineering safe to run in production?

How often should chaos experiments run?

What's the difference between chaos engineering and chaos testing?

Edward Page (Community Contributor)

Related content

Tabletop Exercises vs Live Incident Response Simulations: Finding the Right Fit for Your SRE Team

Ready to make incident response your competitive advantage?