This standard ensures services are restored quickly and safely following failure by measuring and improving Mean Time to Recover (MTTR)-a core DORA metric. It reflects how well teams detect, respond to, and learn from incidents.
Aligned to our "Resilience Over Uptime" and "Post-Incident Learning Culture" policies, this standard builds confidence in recovery, supports system design for failure, and reduces the impact of outages. Without it, teams risk prolonged incidents, fragile systems, and burnout from unplanned work.
Level 1 – Initial: MTTR is not measured. Recovery efforts are reactive, manual, and vary significantly across teams. There is no standardised incident response process.
Level 2 – Managed: Teams begin tracking recovery actions post-incident. Some incidents are reviewed, and recovery steps are loosely documented, but consistency and automation are limited.
Level 3 – Defined: MTTR is tracked systematically across services. Recovery playbooks, escalation paths, and on-call procedures are defined and routinely followed during incidents.
Level 4 – Quantitatively Managed: MTTR is monitored as a key reliability metric. Recovery processes are automated and regularly tested. Post-incident reviews include detailed MTTR analysis.
Level 5 – Optimising: Recovery performance drives systemic improvement. Teams proactively simulate failure scenarios, improve resilience, and shorten MTTR through better tooling, architecture, and shared learning.