Ragan McGill

Playbook : Blameless Post-Mortems Playbook

🌐 Purpose

To establish a consistent, psychologically safe, and learning-focused approach to conducting post-incident reviews. This playbook ensures that teams turn failures into fuel for improvement, without fear, blame, or defensiveness.

⚖️ Principles

Focus on systems and processes, not individuals
Assume everyone did their best with the information available
Surface and share learnings across teams
Prioritise transparency, humility, and continuous improvement
Translate insights into actionable change

✅ Outcomes We Expect

Stronger systems through shared learning
Reduced recurrence of similar incidents
A culture of psychological safety and openness
Better cross-team alignment on risk, ownership, and design

⚡ When to Run a Post-Mortem

Severity 1 or 2 incidents (customer/business impact)
Near misses or high-risk bugs caught before production
Any event that uncovered a systemic failure or process gap
At team discretion when learning potential is high

📊 Post-Mortem Process

1. Incident Review Triggered

Auto-trigger from incident management tool (e.g., Sev1 closed)
Assign facilitator (not the incident responder)

2. Data Collection & Timeline

Gather logs, monitoring data, Slack threads, call transcripts
Build a shared timeline of the incident events (who, what, when)

3. Facilitated Review Meeting

Invite relevant participants (including those outside the team)
Create a psychologically safe space (restate blameless principle)
Walk through timeline collaboratively, ask open questions

4. Identify Contributing Factors

Focus on conditions and decision-making, not root blame
Use techniques like “5 whys” or causal loop diagrams
Capture gaps in process, tooling, communication, or design

5. Define Follow-Up Actions

Identify both tactical remediations and systemic improvements
Assign owners and due dates for each action
Add to backlog, OKRs, or work tracker

Write a short, structured incident review (template below)
Share in cross-team channels, learning repositories, or show & tells

🔹 Structured Review Template

Title: Clear, descriptive title
Summary: What happened, impact, response
Timeline: Sequence of events
Contributing Factors: Not root cause
What Went Well: Acknowledge effective response elements
Areas for Improvement: Systemic insights
Actions: Concrete remediations and improvements
Links: Related logs, dashboards, tickets, etc.

⚙️ Tooling & Automation

Incident tools: PagerDuty, Opsgenie, FireHydrant
Documentation: Confluence, Notion, Google Docs
Workflow tracking: Jira, Linear, Trello
Communication: Slack/Teams integrations to nudge post-mortem creation

🔄 Continuous Improvement

Review follow-up actions in retrospectives
Track action closure rate and time-to-learn
Regularly review themes and patterns across post-mortems
Create summary digests or "quarterly incident learning" sessions

🔧 Key Roles

Facilitator: Guides the session neutrally
Incident Responder(s): Shares lived experience
Engineering Owner: Accountable for follow-up
Scribe: Captures notes, actions, and decisions

📈 Metrics to Monitor

% of incidents with completed post-mortems
% of actions closed within 30 days
Number of repeat incident themes
Average time from incident to published post-mortem

🔑 Governance Link

This playbook supports:

Policy: Post-Incident Learning Culture, Psychological Safety First
Standards: Conduct Blameless Post-Mortems, Classify Incidents, Ensure Every Post-Mortem Results in Concrete Actions, Share Learnings Across Teams