Playbook : Blameless Post-Mortems Playbook
🌐 Purpose
To establish a consistent, psychologically safe, and learning-focused approach to conducting post-incident reviews. This playbook ensures that teams turn failures into fuel for improvement, without fear, blame, or defensiveness.
⚖️ Principles
- Focus on systems and processes, not individuals
- Assume everyone did their best with the information available
- Surface and share learnings across teams
- Prioritise transparency, humility, and continuous improvement
- Translate insights into actionable change
✅ Outcomes We Expect
- Stronger systems through shared learning
- Reduced recurrence of similar incidents
- A culture of psychological safety and openness
- Better cross-team alignment on risk, ownership, and design
⚡ When to Run a Post-Mortem
- Severity 1 or 2 incidents (customer/business impact)
- Near misses or high-risk bugs caught before production
- Any event that uncovered a systemic failure or process gap
- At team discretion when learning potential is high
📊 Post-Mortem Process
1. Incident Review Triggered
- Auto-trigger from incident management tool (e.g., Sev1 closed)
- Assign facilitator (not the incident responder)
2. Data Collection & Timeline
- Gather logs, monitoring data, Slack threads, call transcripts
- Build a shared timeline of the incident events (who, what, when)
3. Facilitated Review Meeting
- Invite relevant participants (including those outside the team)
- Create a psychologically safe space (restate blameless principle)
- Walk through timeline collaboratively, ask open questions
4. Identify Contributing Factors
- Focus on conditions and decision-making, not root blame
- Use techniques like “5 whys” or causal loop diagrams
- Capture gaps in process, tooling, communication, or design
5. Define Follow-Up Actions
- Identify both tactical remediations and systemic improvements
- Assign owners and due dates for each action
- Add to backlog, OKRs, or work tracker
6. Publish & Share Learnings
- Write a short, structured incident review (template below)
- Share in cross-team channels, learning repositories, or show & tells
🔹 Structured Review Template
- Title: Clear, descriptive title
- Summary: What happened, impact, response
- Timeline: Sequence of events
- Contributing Factors: Not root cause
- What Went Well: Acknowledge effective response elements
- Areas for Improvement: Systemic insights
- Actions: Concrete remediations and improvements
- Links: Related logs, dashboards, tickets, etc.
- Incident tools: PagerDuty, Opsgenie, FireHydrant
- Documentation: Confluence, Notion, Google Docs
- Workflow tracking: Jira, Linear, Trello
- Communication: Slack/Teams integrations to nudge post-mortem creation
🔄 Continuous Improvement
- Review follow-up actions in retrospectives
- Track action closure rate and time-to-learn
- Regularly review themes and patterns across post-mortems
- Create summary digests or "quarterly incident learning" sessions
🔧 Key Roles
- Facilitator: Guides the session neutrally
- Incident Responder(s): Shares lived experience
- Engineering Owner: Accountable for follow-up
- Scribe: Captures notes, actions, and decisions
📈 Metrics to Monitor
- % of incidents with completed post-mortems
- % of actions closed within 30 days
- Number of repeat incident themes
- Average time from incident to published post-mortem
🔑 Governance Link
This playbook supports:
- Policy: Post-Incident Learning Culture, Psychological Safety First
- Standards: Conduct Blameless Post-Mortems, Classify Incidents, Ensure Every Post-Mortem Results in Concrete Actions, Share Learnings Across Teams
📖 Further Reading
- "Blameless PostMortems and a Just Culture" – Etsy
- "Site Reliability Engineering" – Google SRE Book (Chapter 15)
- "How to Run a Post-Mortem" – Incident.io Guide