Ragan McGill

Practice : Design for Failure

Purpose and Strategic Importance

Design for Failure is an architectural mindset and engineering practice that embraces the inevitability of failure in complex systems. Rather than trying to eliminate every possible fault, it aims to build software and infrastructure that can detect, isolate, recover from, and even thrive through failure.

By designing for failure, teams build more resilient systems, reduce downtime, and ensure consistent user experience under adverse conditions - making reliability a core feature, not an afterthought.

Description of the Practice

Accept that failures (e.g. network latency, system crashes, service unavailability) will happen.
Architect systems to tolerate and recover from these failures gracefully.
Use patterns such as retries with backoff, circuit breakers, timeouts, and bulkheads.
Ensure observability and alerting are in place to detect and respond quickly.
Apply fault injection techniques (e.g. chaos engineering) to test recovery capabilities.

How to Practise It (Playbook)

1. Getting Started

Identify critical components and potential single points of failure.
Add defensive programming mechanisms like timeouts and fallback logic.
Implement retries with exponential backoff and set sensible retry limits.
Ensure logs and metrics capture error conditions clearly and actionably.

2. Scaling and Maturing

Use circuit breakers to prevent cascading failures in distributed systems.
Segment workloads using bulkheads to limit the blast radius of failures.
Test failover and recovery scenarios regularly through chaos testing or gamedays.
Design systems to degrade gracefully under load or partial outage.
Automate failovers and alerts to reduce time-to-recovery.

3. Team Behaviours to Encourage

Ask "what happens when this fails?" during design and review sessions.
Embrace postmortems and blameless retrospectives to learn from incidents.
Prioritise building in resilience as part of the Definition of Done.
Share recovery playbooks and failure patterns across teams.

4. Watch Out For…

Building overly complex failure-handling logic that introduces new risks.
Relying solely on manual recovery steps - automate where possible.
Inconsistent practices across services - standardise where appropriate.
Lack of real-world testing that leaves recovery logic unverified.

5. Signals of Success

Systems isolate and recover from failures with minimal disruption.
Time to detect and resolve incidents is decreasing.
Recovery patterns (e.g. retries, fallbacks) are observable and effective.
Engineers feel confident in system reliability under duress.
Customers experience fewer outages and more consistent service quality.