Ragan McGill

Playbook : Observability & Monitoring Playbook

🌐 Purpose

To ensure our engineering systems are visible, diagnosable, and measurable in real time - enabling teams to detect, understand, and resolve issues quickly. This playbook outlines how to embed observability into our products from day one, not just bolt it on after incidents.

⚖️ Principles

You build it, you own it - including how it’s monitored
Observability is an engineering capability, not an ops afterthought
Monitoring tells you what is broken; observability helps explain why
Signal over noise: optimise for fast, actionable feedback
Make problems easier to find and faster to fix

✅ What Good Looks Like

All systems expose health metrics, logs, and traces by default
Alerts fire on symptoms users care about - not just thresholds
Dashboards are actionable and owned by teams
Teams respond quickly, learn fast, and feed insight back into design

📊 Key Observability Components

1. Metrics

Quantitative indicators (e.g. latency, error rates, throughput)
Aggregated over time for trend analysis
Examples: request count, CPU usage, SLO compliance

2. Logs

Structured, queryable records of events
Centralised with standard fields (e.g. request ID, timestamp)
Use for forensic debugging and auditing

3. Traces

End-to-end request visibility across systems and services
Useful for identifying bottlenecks and latency hotspots

4. Dashboards

Real-time system views tied to key user journeys or services
Should tell a story at a glance (health, performance, risk)

5. Alerts

Triggered by user-impacting symptoms, not internal implementation
Tied to SLOs or error budgets
Routed to the right team, with actionable context

⚡ Engineering Practices

Embed Observability Early

Define what success looks like from a user/system perspective
Instrument code with semantic logging, metrics, and tracing

Shift Left on Monitoring

Include observability in Definition of Done
Review telemetry coverage in code reviews
Monitor from the perspective of the end-user experience

Automate Where Possible

Standard observability sidecars or libraries
Infra-as-code for alert definitions and dashboards

Run Failure & Alert Simulations

Test that dashboards and alerts are meaningful
Include observability gaps in incident reviews

Make Observability Actionable

Avoid vanity dashboards or unactionable alerts
Focus on system health, degradation patterns, and performance trends

📈 Key Metrics to Track

Request Rate, Error Rate, Latency (RED metrics)
Saturation metrics (e.g. queue length, CPU usage)
SLO compliance and burn rate
Alert volume and signal-to-noise ratio
MTTR and detection time for incidents

🔧 Tools & Platforms

Metrics/Monitoring: Prometheus, Datadog, New Relic, CloudWatch
Logging: ELK/EFK stack, Loki, Splunk, Fluentd
Tracing: OpenTelemetry, Jaeger, Honeycomb, Zipkin
Dashboards: Grafana, Kibana, Datadog
Alerting & Incident Routing: PagerDuty, Opsgenie, VictorOps

🔑 Governance Link

This playbook supports:

Policy: Resilience Over Uptime, Data-Driven Decision-Making
Standards: Embed Telemetry in Every System, Implement Real-Time Alerts, Measure Engineering Outcomes, Build Observability from Day One

📖 Further Reading

Google SRE Book (Chapters on Monitoring & Alerting)
"Observability Engineering" by Charity Majors et al.
RED & USE monitoring frameworks
OpenTelemetry project and documentation
Honeycomb’s Observability Maturity Model