Standard : Services are restored quickly and safely following failure (MTTR)

Purpose and Strategic Importance

This standard ensures services are restored quickly and safely following failure by measuring and improving Mean Time to Recover (MTTR)-a core DORA metric. It reflects how well teams detect, respond to, and learn from incidents.

Aligned to our "Resilience Over Uptime" and "Post-Incident Learning Culture" policies, this standard builds confidence in recovery, supports system design for failure, and reduces the impact of outages. Without it, teams risk prolonged incidents, fragile systems, and burnout from unplanned work.

Strategic Impact

Limits the impact of failures on customers and internal teams
Builds trust in the engineering organisation’s reliability and responsiveness
Encourages design for failure, observability, and automation
Reinforces confidence in continuous delivery and experimentation
Reduces on-call burden and incident fatigue

Risks of Not Having This Standard

Prolonged outages or data incidents that erode user trust
Escalation fatigue and inconsistent recovery actions
Delays in root cause analysis and missed learning opportunities
Reduced system resilience and confidence in platform capabilities
Over-reliance on heroics and manual triage

CMMI Maturity Model

Level 1 – Initial: MTTR is not measured. Recovery efforts are reactive, manual, and vary significantly across teams. There is no standardised incident response process.
Level 2 – Managed: Teams begin tracking recovery actions post-incident. Some incidents are reviewed, and recovery steps are loosely documented, but consistency and automation are limited.
Level 3 – Defined: MTTR is tracked systematically across services. Recovery playbooks, escalation paths, and on-call procedures are defined and routinely followed during incidents.
Level 4 – Quantitatively Managed: MTTR is monitored as a key reliability metric. Recovery processes are automated and regularly tested. Post-incident reviews include detailed MTTR analysis.
Level 5 – Optimising: Recovery performance drives systemic improvement. Teams proactively simulate failure scenarios, improve resilience, and shorten MTTR through better tooling, architecture, and shared learning.

Key Measures

Mean Time to Recover (MTTR): Time from incident detection to full recovery
Time to detect, escalate, respond, and resolve incidents
Frequency of recovery testing (e.g. failover, backup restore, chaos experiments)
Coverage and usage of runbooks and automated response playbooks
Time between detection and meaningful communications with users or stakeholders
Incident impact severity vs. time to mitigation