Monitoring
Monitoring is the practice of collecting, analysing, and visualising metrics, logs, traces, and events to understand the behaviour and performance of systems.
Effective monitoring enables observability, supports rapid issue detection, and provides feedback for continuous improvement.
Level 1 – Initial (Ad Hoc)
Monitoring is limited, reactive, and team-specific.
Most issues are discovered via user reports or support tickets, not by the system itself.
- No consistent monitoring across environments
- Teams rely on manual log inspection
- Alerts are noisy or absent
- No correlation between metrics, logs, or traces
- Root cause analysis is slow and incomplete
Level 2 – Managed (Emerging Practice)
Basic monitoring is implemented for key components, often using off-the-shelf tools.
Teams start tracking system health but coverage and depth vary.
- Dashboards and uptime monitors exist for some systems
- Alerts may be configured but are not always actionable
- Monitoring focuses on infrastructure (e.g. CPU, memory), not user experience
- Issues may still be detected late or missed entirely
- Monitoring is seen as an ops responsibility
Level 3 – Defined (Standardised)
Monitoring is integrated across environments, pipelines, and teams.
System and application-level metrics are captured and used consistently.
- Metrics cover performance, availability, and user impact
- Logs, traces, and metrics are centralised and correlated
- Alerting is rule-based and tied to business-impact thresholds
- Teams monitor SLIs/SLOs and can identify anomalies proactively
- Monitoring supports incident response and continuous delivery
Level 4 – Quantitatively Managed (Measured & Controlled)
Monitoring is data-driven, automated, and actionable.
System health is continuously evaluated against service and product objectives.
- Teams track mean time to detect (MTTD), resolve (MTTR), and alert fatigue
- Monitoring coverage is assessed regularly and gaps are addressed
- SLAs/SLOs are enforced with alerting tied to error budgets
- Monitoring informs capacity planning, release readiness, and risk assessments
- Instrumentation is treated as code and version-controlled
Level 5 – Optimising (Continuous Improvement)
Monitoring is part of a holistic observability strategy.
It enables predictive insights, real-time adaptation, and organisational learning.
- Teams use anomaly detection, adaptive alerting, and distributed tracing
- Monitoring data informs product, UX, and architectural decisions
- Feedback loops from monitoring fuel system and process improvements
- Monitoring is embedded into feature design, not just ops
- Observability enables teams to move fast, recover quickly, and experiment safely