• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Mean Time to Recovery (MTTR)

Description

Mean Time to Recovery (MTTR) is a key measure of operational resilience and engineering responsiveness. It tracks the average time taken to restore service after a production incident or degradation. MTTR is one of the four DORA metrics and reflects how quickly a team can detect, respond to, and resolve issues in production.

Lower MTTR means teams can recover from failures faster, reducing user impact, protecting customer trust, and enabling safe experimentation.

How to Use

What to Measure

  • Time between the start of a service degradation or incident and full restoration of normal operations.
  • MTTR can be calculated across services, teams, or the entire platform.

Formula

MTTR = Total downtime across incidents / Number of incidents

Instrumentation Tips

  • Use incident tracking tools (e.g. Opsgenie, PagerDuty, Jira Ops) to timestamp incident start and resolution.
  • Align monitoring and alerting systems with incident logs to validate recovery time.
  • Exclude false positives or planned maintenance from this metric.

MTTR Benchmarks (contextual—not absolute)

Type of System Target MTTR
Customer-facing APIs < 30 minutes
Internal batch systems < 2 hours
Infrastructure services < 15 minutes

Why It Matters

  • Resilience: Faster recovery reduces business impact.
  • Trust: Builds stakeholder and customer confidence in engineering operations.
  • Learning: Highlights opportunities to improve incident response and system design.
  • Safety: Enables teams to take calculated risks, knowing recovery is reliable.

Best Practices

  • Embed monitoring and alerting early in the design process.
  • Automate known remediation tasks (e.g. rollbacks, restarts).
  • Create and test incident response runbooks.
  • Conduct postmortems and track learnings as backlog items.
  • Train teams in chaos engineering and recovery drills.

Common Pitfalls

  • Measuring recovery from detection, not incident start.
  • Including unrelated incidents (e.g. user error or vendor outages).
  • Relying on manual steps that increase response time.
  • Focusing on MTTR alone without understanding root causes or team burnout.

Signals of Success

  • Incident response is fast, repeatable, and low-stress.
  • Teams learn from incidents and improve recovery time over successive cycles.
  • MTTR trends downward even as deployment frequency increases.
  • Automation and observability reduce manual debugging time.

Related Measures

  • [[Change Failure Rate]]
  • [[CoE/Measures/Delivery Performance/Deployment Frequency]]
  • [[Lead Time for Change]]
  • [[Time to Detect (MTTD)]]

Aligned Industry Research

  • State of DevOps / DORA (2021–2023): Elite performers recover in under an hour; low performers take days or more.
  • Accelerate (Forsgren et al.): MTTR is a key resilience indicator that supports faster innovation and improved system reliability.

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering