• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Services are restored quickly and safely following failure (MTTR)

Purpose and Strategic Importance

This standard ensures services are restored quickly and safely following failure by measuring and improving Mean Time to Recover (MTTR)-a core DORA metric. It reflects how well teams detect, respond to, and learn from incidents.

Aligned to our "Resilience Over Uptime" and "Post-Incident Learning Culture" policies, this standard builds confidence in recovery, supports system design for failure, and reduces the impact of outages. Without it, teams risk prolonged incidents, fragile systems, and burnout from unplanned work.

Strategic Impact

  • Limits the impact of failures on customers and internal teams
  • Builds trust in the engineering organisation’s reliability and responsiveness
  • Encourages design for failure, observability, and automation
  • Reinforces confidence in continuous delivery and experimentation
  • Reduces on-call burden and incident fatigue

Risks of Not Having This Standard

  • Prolonged outages or data incidents that erode user trust
  • Escalation fatigue and inconsistent recovery actions
  • Delays in root cause analysis and missed learning opportunities
  • Reduced system resilience and confidence in platform capabilities
  • Over-reliance on heroics and manual triage

CMMI Maturity Model

  • Level 1 – Initial: MTTR is not measured. Recovery efforts are reactive, manual, and vary significantly across teams. There is no standardised incident response process.

  • Level 2 – Managed: Teams begin tracking recovery actions post-incident. Some incidents are reviewed, and recovery steps are loosely documented, but consistency and automation are limited.

  • Level 3 – Defined: MTTR is tracked systematically across services. Recovery playbooks, escalation paths, and on-call procedures are defined and routinely followed during incidents.

  • Level 4 – Quantitatively Managed: MTTR is monitored as a key reliability metric. Recovery processes are automated and regularly tested. Post-incident reviews include detailed MTTR analysis.

  • Level 5 – Optimising: Recovery performance drives systemic improvement. Teams proactively simulate failure scenarios, improve resilience, and shorten MTTR through better tooling, architecture, and shared learning.


Key Measures

  • Mean Time to Recover (MTTR): Time from incident detection to full recovery
  • Time to detect, escalate, respond, and resolve incidents
  • Frequency of recovery testing (e.g. failover, backup restore, chaos experiments)
  • Coverage and usage of runbooks and automated response playbooks
  • Time between detection and meaningful communications with users or stakeholders
  • Incident impact severity vs. time to mitigation
Associated Policies
  • Resilience Over Uptime
  • Post-Incident Learning Culture
Associated Practices
  • Auto-scaling Infrastructure
  • Chaos Engineering
  • Blue-Green Deployments
  • Canary Releases
  • Design for Failure
Associated Measures
  • Mean Time to Recovery (MTTR)

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering