Standard : Changes are introduced with minimal failures and maximum resilience (CFR)

Purpose and Strategic Importance

This standard ensures changes are introduced with minimal failures and maximum resilience by measuring and managing Change Failure Rate (CFR)-a core DORA metric. It enables high-velocity delivery without compromising quality, stability, or trust.

Aligned to our "Resilience Over Uptime" and "Secure by Design" policies, this standard drives investment in robust testing, observability, and safe deployment practices. Without it, change introduces risk blindly, erodes confidence, and limits the ability to innovate at pace.

Strategic Impact

Increases delivery confidence and system stability
Encourages robust testing, observability, and rollback strategies
Enhances reliability across software, data, and infrastructure
Builds trust with stakeholders and users through quality and resilience
Supports high deployment frequency without accumulating operational risk

Risks of Not Having This Standard

Increased operational incidents due to fragile changes
Hidden bugs, regressions, or data issues impact users and downstream teams
Diminished trust in the reliability of engineering delivery
Platform teams become bottlenecks due to fear of risk
Poor CFR inhibits experimentation and continuous improvement

CMMI Maturity Model

Level 1 – Initial: Change failures are not tracked systematically. Most issues are only identified post-release through user reports or major outages. Root cause analysis is rare or informal.
Level 2 – Managed: Some changes are linked to incidents or rollbacks, but criteria are unclear and processes are inconsistent. Post-incident reviews are ad hoc and learning is not widely shared.
Level 3 – Defined: Change Failure Rate is consistently tracked across systems. Teams agree on what constitutes a failed change and incorporate this into retrospectives and quality reviews.
Level 4 – Quantitatively Managed: CFR is a visible delivery health metric. Teams use CFR data to inform test coverage, deployment safety, and risk mitigation. Proactive techniques (e.g., automated rollbacks, staging validation) are applied.
Level 5 – Optimising: CFR insights drive systemic improvements in testing, observability, and architectural resilience. Failure trends shape platform capabilities, and shared learning reduces risk across the engineering organisation.

Key Measures

Change Failure Rate (%): Percentage of production changes that result in a P1/P2 incident, significant degradation, or a known security risk (as per DORA definition)
Number of rollbacks or patches per change
Post-release incident rate attributable to recent changes
Number of deployments with follow-up remediation activity
Time from change deployment to detection of failure
Incident root causes linked to recent changes

A failed change

Specifically, a failed change is defined as one that introduces:

A P1 or P2 incident requiring immediate or high-priority response
A material degradation in user experience (e.g. slowness, flakiness, broken functionality, data quality issues)
A security vulnerability or exposure, such as access misconfiguration or exploitable code

This definition applies equally across software releases, infrastructure rollouts, data platform updates, and operational configuration changes.