Ragan McGill

Proactive Notification

Proactive Notification refers to the ability of systems to detect anomalies, performance degradation, or failure conditions and alert the right people before customers are impacted.
It’s a key component of modern incident management and observability, ensuring that teams can respond swiftly, reduce downtime, and protect user trust.

Level 1 – Initial (Ad Hoc)

Notifications are either absent or based on manual observation.
Teams find out about issues from users or too late to prevent significant impact.

Alerts are missing, misconfigured, or ignored
No distinction between noise and meaningful signals
Teams lack confidence in alerts, leading to alert fatigue
Incidents are often escalated without context or actionable information
Critical issues go unnoticed until they cause business disruption

Level 2 – Managed (Emerging Practice)

Basic alerting exists, typically tied to infrastructure or uptime.
Some incidents are caught early, but coverage is limited and responses are inconsistent.

Static thresholds trigger alerts for known conditions (e.g. CPU > 90%)
Email or Slack notifications may be in place, but are uncoordinated
Alerts may be noisy or fire after customer impact has already occurred
Runbooks or escalation paths may exist but are not followed reliably
Alert ownership is unclear or siloed by role (e.g. ops only)

Level 3 – Defined (Standardised)

Notification practices are structured, actionable, and aligned to system health and business impact.
Teams receive meaningful alerts that support rapid triage and response.

Alerts are routed based on service ownership and severity
Notifications are tied to SLIs and SLOs, not just technical metrics
Alert fatigue is managed through tuning and suppression strategies
On-call rotas and escalation paths are clearly defined and followed
Alerts include rich context (logs, traces, linked dashboards)

Level 4 – Quantitatively Managed (Measured & Controlled)

Notification systems are tuned based on performance data and feedback.
Alert quality, relevance, and response effectiveness are measured and optimised.

Metrics include false alert rate, alert-to-acknowledge time, MTTR
Teams measure alert volume per service and engineer
Intelligent alerting (e.g. anomaly detection, rate of change) is used
Automated escalations and incident triggers improve responsiveness
Post-incident reviews track whether alerts were timely and helpful

Level 5 – Optimising (Continuous Improvement)

Proactive notifications are intelligent, adaptive, and fully integrated into incident and product feedback loops.
The system detects and responds to failure patterns before customers notice.

Predictive alerting identifies degradation before failure
Notifications adapt to usage patterns and business impact
Alerts are simulated and tested regularly (e.g. chaos engineering)
Alert quality is continuously improved through feedback and automation
Notifications inform product priorities, engineering focus, and platform evolution