Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your financial services organisation.
Chaos engineering is the discipline of deliberately injecting controlled failures - such as server shutdowns, network latency, and dependency outages - into production or pre-production systems. The goal is to discover weaknesses before real outages expose them, testing whether a system can maintain acceptable performance under turbulent, real-world conditions.Modern software systems are too complex for any team to fully predict. Hundreds of microservices, third-party APIs, multi-region cloud deployments - a single misconfigured timeout or a cloud provider's transient network partition isn't an edge case. It's Tuesday. Chaos engineering accepts that these failures are inevitable and shifts the focus from prevention to resilience: building systems that absorb faults and continue operating within acceptable thresholds. Chaos engineering is a scientific practice - sometimes called resilience testing - built on hypotheses, measured observations, and incremental learning.
The Origins of Chaos Engineering
In 2008, Netflix experienced a major database corruption that knocked out services for three days. The outage exposed a fundamental problem: their infrastructure had single points of failure that nobody had tested.When Netflix committed to migrating to Amazon Web Services, things got harder. Cloud instances could be terminated at any moment. Network partitions between availability zones were a fact of life. Rather than trying to engineer all that unpredictability away, Netflix engineers built Chaos Monkey in 2010.Chaos Monkey was built to randomly terminate production instances during business hours. The logic was simple: if instances were going to fail anyway, force engineers to design for it by default rather than as an afterthought.Chaos Monkey proved the concept, but killing individual instances only tests one failure mode. Netflix expanded the approach into the Simian Army:
- Latency Monkey injected artificial network delays between services
- Conformity Monkey flagged instances that violated best practices
- Chaos Gorilla simulated entire availability zone outages
The discipline progressed from "can we survive losing one server" to "can we survive losing an entire region".The practice spread beyond Netflix:
- Amazon's GameDay programme introduced scheduled exercises where engineers triggered failure scenarios against production
- Google's DiRT (Disaster Recovery Testing) tested both infrastructure and organisational response
- principlesofchaos.org codified the methodology into a set of community-established principles
- LitmusChaos graduated as a CNCF project, signalling that the Kubernetes ecosystem considered chaos engineering a mainstream practice
What was once exclusive to tech giants became accessible to mid-market engineering teams through open-source tools and managed cloud services.
How Chaos Engineering Works
Every chaos experiment follows a structured, repeatable cycle.
1. Define steady state
Before you test how a system handles failure, you need to know what "normal" looks like. Steady state is defined through measurable outputs that reflect the customer experience:
- Request latency at p50, p95, and p99
- Error rates
- Throughput
- CPU and memory utilisation
These metrics are your control group. Focus on outputs that users actually feel rather than internal system metrics that might fluctuate without visible impact.
2. Form a hypothesis
A chaos experiment starts with a specific, falsifiable prediction:
- A strong hypothesis looks like this: "If the payment gateway returns 500 errors for 30 seconds, the checkout service will retry twice, fall back to queued processing, and p99 latency will increase by no more than 200ms. No customer-facing errors will surface."
- A weak hypothesis looks like this: "The system should handle the payment gateway going down."
The difference is specificity. The strong version defines the failure condition, the expected response, and the acceptable performance envelope. When the experiment runs, you know exactly what to measure and what constitutes a pass or failure.If you can't articulate a hypothesis, that's a signal. It usually means you don't understand the system's expected degradation path well enough to test it yet - and that gap in understanding is itself a finding worth documenting.
3. Introduce a controlled fault
Inject a specific fault that mirrors a realistic failure scenario. Common categories include:
- Infrastructure faults: Terminating instances or pods. Exhausting CPU, memory, or disk I/O. Simulating availability zone outages.
- Network faults: Injecting latency between services. Dropping packets. Simulating DNS failures. Creating network partitions.
- Application faults: Forcing dependency timeouts. Returning error responses from downstream services. Introducing clock skew. Corrupting data in transit.
The key constraint is blast radius. Start small - a single instance, a single service, a low percentage of traffic - and expand only after confirming your safety controls work and your observability captures the impact.
4. Observe and measure
Compare live system behaviour against your steady-state baseline. Your observability stack does the heavy lifting here:
- Distributed tracing shows how the fault propagated through your service graph
- Log aggregation captures error messages and fallback behaviour
- Metrics dashboards display whether latency, error rates, and throughput stayed within the bounds your hypothesis defined
Without strong observability, chaos experiments produce noise instead of signal. You inject a fault, something goes wrong, and you can't tell whether the degradation matches your prediction or is unrelated to the experiment entirely.
5. Analyse, fix, verify
If the system behaved as predicted, the experiment passed and you've built confidence.If it deviated - latency spiked beyond the threshold, customer-facing errors appeared, a cascading failure propagated further than expected - you've found a weakness. Document the failure mode, prioritise a fix, and run the same experiment again after the fix ships to confirm the improvement.Each cycle either validates existing resilience or reveals the next weakness to address. Log the following results consistently:
- Baseline metrics before the experiment
- Observed metrics during experiment
- The experiment outcome
This lets you track resilience improvements over successive runs and demonstrate the value of the practice to engineering leadership.
Core Principles of Chaos Engineering
The Principles of Chaos Engineering, published at principlesofchaos.org by practitioners who developed the discipline at Netflix and beyond, outline the rules that guide mature implementations.
- Run experiments in production. Pre-production environments rarely replicate the complexity of live traffic patterns, data volumes, and infrastructure configurations. Production experiments give the highest-fidelity signal.
- Minimise blast radius. Start with the smallest possible scope and expand only after building confidence. Every experiment should have automatic stop conditions tied to your observability alerts - if a key metric crosses a threshold, the experiment halts and the fault rolls back.
- Automate and run continuously. Manual one-off experiments don't scale. Mature teams integrate chaos experiments into CI/CD pipelines so every deployment is tested. This is the progression from periodic GameDays to continuous chaos that catches regressions automatically.
- Build hypotheses around steady-state behaviour. Focus on measurable system output (such as latency, throughput and error rates) rather than internal implementation details. Behaviour-level hypotheses survive code refactors. Implementation-level hypotheses break every time someone reorganises a module.
Chaos Engineering vs Traditional Testing
Traditional testing (such as unit tests, integration tests, end-to-end tests) validates that a system works correctly under known, expected conditions. Chaos engineering tests whether a system works well enough under unknown, unexpected conditions.In practice, the distinction looks like this: traditional testing asks "does this function return the correct value?" Chaos engineering asks "what happens to the entire checkout flow when the payment gateway returns 500 errors for 30 seconds?"One practical difference that catches teams off guard is observability. Traditional tests have built-in assertions. Chaos experiments depend entirely on external observability infrastructure to determine whether the system behaved acceptably. Without it, results are ambiguous.Both belong in a mature reliability strategy. Traditional tests prevent known regressions. Chaos experiments surface unknown systemic risks. Teams that rely only on traditional testing typically discover their most critical failure modes during real incidents.
Getting Started With Your First Chaos Experiment
For SRE and DevOps teams that have observability, defined SLOs, and a baseline understanding of their system's failure modes, the path from zero to a running experiment is shorter than most people expect.
- Start in staging. Build team confidence, verify observability coverage, and confirm your safety controls work.
- Pick a well-understood service. Clear ownership, comprehensive monitoring, known architecture. Not the most critical or least understood system.
- Use a simple fault. Terminate a single pod. Inject 200ms of latency on one network path. Watch whether alerts fire, dashboards show the impact, and recovery mechanisms activate.
- Run a GameDay. Gather the team, run the experiment live, and discuss results immediately. GameDays surface gaps in runbooks, alerting, and incident response processes that documentation reviews alone don't catch.
- Move to production. Once comfortable, transition to production with tight blast radius limits. Notify affected teams before running experiments - surprises during production fault injection erode trust in the practice. Then gradually increase scope and frequency, with the end goal of embedding chaos experiments into your CI/CD pipeline so every deployment gets tested automatically.
Platforms like Gremlin, LitmusChaos, Chaos Mesh, and AWS Fault Injection Service each approach fault injection differently. For a detailed breakdown of how these tools compare and where incident response simulation fits into the picture, see our chaos engineering tools comparison guide.
What Chaos Engineering Doesn't Solve
Chaos engineering tests whether your systems can handle failure. It doesn't test whether your team can.A well-designed experiment might confirm that your payment service degrades gracefully when a database replica goes offline. But it won't tell you whether the on-call engineer who gets paged at 3 AM knows how to interpret that alert. It won't reveal whether your incident commander can coordinate five responders across three time zones under pressure. And it won't surface the fact that your escalation runbook still references infrastructure you decommissioned two quarters ago.System resilience and team resilience are different problems. Chaos engineering builds confidence in the first. Incident response simulations - whether tabletop exercises or live simulations - builds confidence in the second. The most reliable organisations invest in both.For a detailed breakdown of how these disciplines work together, see Chaos Engineering Tools vs Incident Response Simulations: What SRE Teams Actually Need.
Frequently Asked Questions
Is chaos engineering safe to run in production?
Yes, when done with proper safeguards. Every experiment should have automatic stop conditions that halt the fault if a key metric crosses a predefined threshold, plus rollback mechanisms that restore normal behaviour immediately. Start with a small blast radius - a single instance or a low percentage of traffic - and expand only as confidence grows. The risk of a well-controlled experiment is significantly lower than discovering the same failure mode during a real incident with no safety net.
How often should chaos experiments run?
That depends on maturity. Start with monthly or quarterly GameDay exercises. As confidence grows, increase to weekly experiments targeting specific services. The end goal is continuous automation - chaos experiments embedded in your CI/CD pipeline, catching regressions as soon as they're introduced.
What's the difference between chaos engineering and chaos testing?
The terms are sometimes used interchangeably, but there's a meaningful distinction. Chaos testing implies a discrete activity - inject a fault, see what breaks, move on. Chaos engineering implies the full scientific method: forming a hypothesis, running a controlled experiment, observing results against a steady-state baseline, and iterating. It's a continuous discipline, not a one-off test.


