Ragan McGill

Monitoring

Monitoring is the practice of collecting, analysing, and visualising metrics, logs, traces, and events to understand the behaviour and performance of systems.
Effective monitoring enables observability, supports rapid issue detection, and provides feedback for continuous improvement.

Level 1 – Initial (Ad Hoc)

Monitoring is limited, reactive, and team-specific.
Most issues are discovered via user reports or support tickets, not by the system itself.

No consistent monitoring across environments
Teams rely on manual log inspection
Alerts are noisy or absent
No correlation between metrics, logs, or traces
Root cause analysis is slow and incomplete

Level 2 – Managed (Emerging Practice)

Basic monitoring is implemented for key components, often using off-the-shelf tools.
Teams start tracking system health but coverage and depth vary.

Dashboards and uptime monitors exist for some systems
Alerts may be configured but are not always actionable
Monitoring focuses on infrastructure (e.g. CPU, memory), not user experience
Issues may still be detected late or missed entirely
Monitoring is seen as an ops responsibility

Level 3 – Defined (Standardised)

Monitoring is integrated across environments, pipelines, and teams.
System and application-level metrics are captured and used consistently.

Metrics cover performance, availability, and user impact
Logs, traces, and metrics are centralised and correlated
Alerting is rule-based and tied to business-impact thresholds
Teams monitor SLIs/SLOs and can identify anomalies proactively
Monitoring supports incident response and continuous delivery

Level 4 – Quantitatively Managed (Measured & Controlled)

Monitoring is data-driven, automated, and actionable.
System health is continuously evaluated against service and product objectives.

Teams track mean time to detect (MTTD), resolve (MTTR), and alert fatigue
Monitoring coverage is assessed regularly and gaps are addressed
SLAs/SLOs are enforced with alerting tied to error budgets
Monitoring informs capacity planning, release readiness, and risk assessments
Instrumentation is treated as code and version-controlled

Level 5 – Optimising (Continuous Improvement)

Monitoring is part of a holistic observability strategy.
It enables predictive insights, real-time adaptation, and organisational learning.

Teams use anomaly detection, adaptive alerting, and distributed tracing
Monitoring data informs product, UX, and architectural decisions
Feedback loops from monitoring fuel system and process improvements
Monitoring is embedded into feature design, not just ops
Observability enables teams to move fast, recover quickly, and experiment safely