Playbook : Observability & Monitoring Playbook
π Purpose
To ensure our engineering systems are visible, diagnosable, and measurable in real time - enabling teams to detect, understand, and resolve issues quickly. This playbook outlines how to embed observability into our products from day one, not just bolt it on after incidents.
βοΈ Principles
- You build it, you own it - including how itβs monitored
- Observability is an engineering capability, not an ops afterthought
- Monitoring tells you what is broken; observability helps explain why
- Signal over noise: optimise for fast, actionable feedback
- Make problems easier to find and faster to fix
β
What Good Looks Like
- All systems expose health metrics, logs, and traces by default
- Alerts fire on symptoms users care about - not just thresholds
- Dashboards are actionable and owned by teams
- Teams respond quickly, learn fast, and feed insight back into design
π Key Observability Components
1. Metrics
- Quantitative indicators (e.g. latency, error rates, throughput)
- Aggregated over time for trend analysis
- Examples: request count, CPU usage, SLO compliance
2. Logs
- Structured, queryable records of events
- Centralised with standard fields (e.g. request ID, timestamp)
- Use for forensic debugging and auditing
3. Traces
- End-to-end request visibility across systems and services
- Useful for identifying bottlenecks and latency hotspots
4. Dashboards
- Real-time system views tied to key user journeys or services
- Should tell a story at a glance (health, performance, risk)
5. Alerts
- Triggered by user-impacting symptoms, not internal implementation
- Tied to SLOs or error budgets
- Routed to the right team, with actionable context
β‘ Engineering Practices
Embed Observability Early
- Define what success looks like from a user/system perspective
- Instrument code with semantic logging, metrics, and tracing
Shift Left on Monitoring
- Include observability in Definition of Done
- Review telemetry coverage in code reviews
- Monitor from the perspective of the end-user experience
Automate Where Possible
- Standard observability sidecars or libraries
- Infra-as-code for alert definitions and dashboards
Run Failure & Alert Simulations
- Test that dashboards and alerts are meaningful
- Include observability gaps in incident reviews
Make Observability Actionable
- Avoid vanity dashboards or unactionable alerts
- Focus on system health, degradation patterns, and performance trends
π Key Metrics to Track
- Request Rate, Error Rate, Latency (RED metrics)
- Saturation metrics (e.g. queue length, CPU usage)
- SLO compliance and burn rate
- Alert volume and signal-to-noise ratio
- MTTR and detection time for incidents
- Metrics/Monitoring: Prometheus, Datadog, New Relic, CloudWatch
- Logging: ELK/EFK stack, Loki, Splunk, Fluentd
- Tracing: OpenTelemetry, Jaeger, Honeycomb, Zipkin
- Dashboards: Grafana, Kibana, Datadog
- Alerting & Incident Routing: PagerDuty, Opsgenie, VictorOps
π Governance Link
This playbook supports:
- Policy: Resilience Over Uptime, Data-Driven Decision-Making
- Standards: Embed Telemetry in Every System, Implement Real-Time Alerts, Measure Engineering Outcomes, Build Observability from Day One
π Further Reading
- Google SRE Book (Chapters on Monitoring & Alerting)
- "Observability Engineering" by Charity Majors et al.
- RED & USE monitoring frameworks
- OpenTelemetry project and documentation
- Honeycombβs Observability Maturity Model