Incident and Post-Mortem Practice
Incident and Post-Mortem Practice refers to how an organisation prepares for, responds to, and learns from incidents or outages.
It includes structured incident response, effective communication, and blameless post-incident reviews to drive systemic improvement and resilience.
Level 1 – Initial (Ad Hoc)
Incident response is reactive and uncoordinated.
There is no structured approach to resolving issues, and little or no learning follows the event.
- Incidents are discovered late (often by users)
- Response is improvised with unclear roles and communication
- Fixes are tactical; root causes are not explored
- Accountability is assigned, often through blame
- Post-mortems (if they happen) are undocumented and inconsistent
Level 2 – Managed (Emerging Practice)
Some teams adopt incident response checklists or begin holding post-incident reviews, but practices vary widely.
- A basic incident process exists for major issues
- Some roles (e.g. incident commander) may be defined
- Teams start capturing key timelines and decisions
- Post-mortems occur for high-impact incidents, but not consistently
- Lessons may be shared within the team but rarely acted upon
Level 3 – Defined (Standardised)
Incident response and post-mortem practices are standardised, repeatable, and integrated into the engineering workflow.
- Structured incident response roles, protocols, and runbooks are in place
- All major incidents trigger blameless post-mortems
- Root cause analysis (RCA) includes contributing factors and systemic improvements
- Action items are tracked and prioritised alongside feature work
- Learnings are documented and shared across teams
Level 4 – Quantitatively Managed (Measured & Controlled)
Incident response and learning are measured and improved continuously.
Teams track performance and follow through on commitments to reduce recurrence.
- Metrics include MTTR (mean time to recovery), incident frequency, action item follow-through, and post-mortem participation
- Incident reviews include trend analysis across systems and teams
- Teams practice incident response (e.g. game days, chaos engineering)
- Action items are reviewed for effectiveness and closure rate
- Insights are used to improve observability, deployment safety, and resilience
Level 5 – Optimising (Continuous Improvement)
Incidents are embraced as opportunities to strengthen systems, teams, and culture.
Learning is shared widely and feeds into design, planning, and strategic risk reduction.
- Post-mortems are used to teach, onboard, and inform roadmap priorities
- Cultural safety ensures psychological security during incident analysis
- Trends across incidents inform cross-cutting technical investments
- The organisation celebrates learning and resilience, not perfection
- Incident practices drive systemic improvement, innovation, and cultural strength