Ragan McGill

Incident and Post-Mortem Practice

Incident and Post-Mortem Practice refers to how an organisation prepares for, responds to, and learns from incidents or outages.
It includes structured incident response, effective communication, and blameless post-incident reviews to drive systemic improvement and resilience.

Level 1 – Initial (Ad Hoc)

Incident response is reactive and uncoordinated.
There is no structured approach to resolving issues, and little or no learning follows the event.

Incidents are discovered late (often by users)
Response is improvised with unclear roles and communication
Fixes are tactical; root causes are not explored
Accountability is assigned, often through blame
Post-mortems (if they happen) are undocumented and inconsistent

Level 2 – Managed (Emerging Practice)

Some teams adopt incident response checklists or begin holding post-incident reviews, but practices vary widely.

A basic incident process exists for major issues
Some roles (e.g. incident commander) may be defined
Teams start capturing key timelines and decisions
Post-mortems occur for high-impact incidents, but not consistently
Lessons may be shared within the team but rarely acted upon

Level 3 – Defined (Standardised)

Incident response and post-mortem practices are standardised, repeatable, and integrated into the engineering workflow.

Structured incident response roles, protocols, and runbooks are in place
All major incidents trigger blameless post-mortems
Root cause analysis (RCA) includes contributing factors and systemic improvements
Action items are tracked and prioritised alongside feature work
Learnings are documented and shared across teams

Level 4 – Quantitatively Managed (Measured & Controlled)

Incident response and learning are measured and improved continuously.
Teams track performance and follow through on commitments to reduce recurrence.

Metrics include MTTR (mean time to recovery), incident frequency, action item follow-through, and post-mortem participation
Incident reviews include trend analysis across systems and teams
Teams practice incident response (e.g. game days, chaos engineering)
Action items are reviewed for effectiveness and closure rate
Insights are used to improve observability, deployment safety, and resilience

Level 5 – Optimising (Continuous Improvement)

Incidents are embraced as opportunities to strengthen systems, teams, and culture.
Learning is shared widely and feeds into design, planning, and strategic risk reduction.

Post-mortems are used to teach, onboard, and inform roadmap priorities
Cultural safety ensures psychological security during incident analysis
Trends across incidents inform cross-cutting technical investments
The organisation celebrates learning and resilience, not perfection
Incident practices drive systemic improvement, innovation, and cultural strength