Standard : Mean Time to Detect (MTTD)
Description
Mean Time to Detect (MTTD) measures the average time it takes for teams or systems to detect an issue after it begins. It is a leading indicator of observability, monitoring maturity, and operational awareness. The faster an issue is detected, the faster it can be addressed—minimising customer impact and reducing overall downtime.
MTTD is not one of the four core DORA metrics, but it is commonly used alongside them to assess incident response effectiveness and drive reliability improvements.
How to Use
What to Measure
- Time elapsed between the start of a production issue (e.g. system degradation or failure) and when the issue is first detected by a monitoring system or human.
- Typically measured across all qualifying incidents during a defined period (e.g. per month or quarter).
MTTD = Total detection time across incidents / Number of incidents
Instrumentation Tips
- Timestamp incident start using monitoring or log anomalies, or manual reports.
- Timestamp detection using first alert or first acknowledged detection in tooling.
- Use incident tracking systems (e.g. PagerDuty, Opsgenie, Jira) to correlate events.
Why It Matters
- Proactive detection: Reduces time to resolution and downstream impact.
- Customer protection: Fast detection = faster recovery = fewer unhappy users.
- System health insight: Reveals gaps in monitoring and alerting coverage.
- Team trust: Builds confidence that systems are under control, even at scale.
Best Practices
- Define and test meaningful alerts for key failure modes.
- Track and tune alert precision and alert fatigue.
- Implement end-to-end observability: metrics, logs, traces.
- Include detection time in post-incident reviews.
- Automate anomaly detection to reduce reliance on manual reports.
Common Pitfalls
- Starting the clock from detection, not issue onset.
- Ignoring “silent” failures discovered by customers.
- Alert fatigue masking true signals.
- Failing to update alerts when system behaviour changes.
Signals of Success
- Most issues are detected by automation—not users.
- Detection latency decreases over time.
- Monitoring dashboards align closely with actual incidents.
- Detection metrics are reviewed and improved during postmortems.
- [[Mean Time to Recovery (MTTR)]]
- [[Change Failure Rate]]
- [[Incident Frequency]]
- [[Monitoring Coverage]]