Standard : Monitoring is embedded in design and operations
Purpose and Strategic Importance
This standard ensures that monitoring is a first-class capability built into system design, development, and operations. By instrumenting services, infrastructure, and user workflows with real-time metrics, health checks, and user-experience indicators, teams gain the visibility needed to detect anomalies early, troubleshoot effectively, and maintain high levels of service reliability.
Strategic Impact
- Early Detection & Proactive Response: Continuous metrics underpin automated alerts and dashboards, enabling swift remediation before customer impact.
- Operational Excellence: Visibility into performance and capacity drives cost optimization, scaling decisions, and performance tuning.
- Data-Driven Improvement: Historical and real-time data inform release readiness, feature prioritization, and architectural refinements.
- SLA & Compliance Assurance: Monitoring validates adherence to service-level objectives (SLOs) and regulatory requirements, supporting audits and governance.
Risks of Not Having This Standard
- Blind Spots in Production: Failures and degradations go undetected until they impact users or SLAs.
- Inefficient Troubleshooting: Lack of structured metrics increases meantime-to-diagnosis and extends outages.
- Customer Dissatisfaction: Unnoticed service drift or errors erode trust and satisfaction.
- Operational Debt: Piecemeal, ad hoc monitoring solutions become fragmented and costly to maintain as systems grow.
CMMI Maturity Model
- Level 1 – Initial: Monitoring is ad hoc or manual; teams rely on logs and occasional checks.
- Level 2 – Managed: Key infrastructure and service metrics are collected centrally; basic alerts exist but lack consistency.
- Level 3 – Defined: Standardized metrics schema and dashboards are mandated; SLIs and SLOs are defined and tracked.
- Level 4 – Quantitatively Managed: Coverage, accuracy, and latency of monitoring are measured; anomaly detection and dynamic thresholds are introduced.
- Level 5 – Optimising: Predictive analytics and automated remediation leverage monitoring data; continuous refinement minimizes noise and maximizes relevance.
Key Measures
- Monitoring Coverage: Percentage of services and components emitting standardized metrics and health checks.
- Mean Time to Detect (MTTD): Average time from issue onset to detection via monitoring.
- Alert Precision: Ratio of actionable (true-positive) alerts to total alerts.
- SLO Compliance: Percentage of time services meet defined SLOs.
- Monitoring Latency: Time between event occurrence and metric availability in the central system.