• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Changes are introduced with minimal failures and maximum resilience (CFR)

Purpose and Strategic Importance

This standard ensures changes are introduced with minimal failures and maximum resilience by measuring and managing Change Failure Rate (CFR)-a core DORA metric. It enables high-velocity delivery without compromising quality, stability, or trust.

Aligned to our "Resilience Over Uptime" and "Secure by Design" policies, this standard drives investment in robust testing, observability, and safe deployment practices. Without it, change introduces risk blindly, erodes confidence, and limits the ability to innovate at pace.

Strategic Impact

  • Increases delivery confidence and system stability
  • Encourages robust testing, observability, and rollback strategies
  • Enhances reliability across software, data, and infrastructure
  • Builds trust with stakeholders and users through quality and resilience
  • Supports high deployment frequency without accumulating operational risk

Risks of Not Having This Standard

  • Increased operational incidents due to fragile changes
  • Hidden bugs, regressions, or data issues impact users and downstream teams
  • Diminished trust in the reliability of engineering delivery
  • Platform teams become bottlenecks due to fear of risk
  • Poor CFR inhibits experimentation and continuous improvement

CMMI Maturity Model

  • Level 1 – Initial: Change failures are not tracked systematically. Most issues are only identified post-release through user reports or major outages. Root cause analysis is rare or informal.

  • Level 2 – Managed: Some changes are linked to incidents or rollbacks, but criteria are unclear and processes are inconsistent. Post-incident reviews are ad hoc and learning is not widely shared.

  • Level 3 – Defined: Change Failure Rate is consistently tracked across systems. Teams agree on what constitutes a failed change and incorporate this into retrospectives and quality reviews.

  • Level 4 – Quantitatively Managed: CFR is a visible delivery health metric. Teams use CFR data to inform test coverage, deployment safety, and risk mitigation. Proactive techniques (e.g., automated rollbacks, staging validation) are applied.

  • Level 5 – Optimising: CFR insights drive systemic improvements in testing, observability, and architectural resilience. Failure trends shape platform capabilities, and shared learning reduces risk across the engineering organisation.


Key Measures

  • Change Failure Rate (%): Percentage of production changes that result in a P1/P2 incident, significant degradation, or a known security risk (as per DORA definition)
  • Number of rollbacks or patches per change
  • Post-release incident rate attributable to recent changes
  • Number of deployments with follow-up remediation activity
  • Time from change deployment to detection of failure
  • Incident root causes linked to recent changes

A failed change

Specifically, a failed change is defined as one that introduces:

  • A P1 or P2 incident requiring immediate or high-priority response
  • A material degradation in user experience (e.g. slowness, flakiness, broken functionality, data quality issues)
  • A security vulnerability or exposure, such as access misconfiguration or exploitable code

This definition applies equally across software releases, infrastructure rollouts, data platform updates, and operational configuration changes.

Associated Policies
  • Resilience Over Uptime
  • Secure by Design
  • Post-Incident Learning Culture
Associated Practices
  • Auto-scaling Infrastructure
  • Static Code Analysis
  • Chaos Engineering
  • Self-Healing Systems
  • Blue-Green Deployments
  • Canary Releases
  • Design for Failure
Associated Measures
  • Change Failure Rate (CFR)
  • Automated Remediation Rate
  • Error Budget Consumption
  • Incident Frequency
  • Service Availability (Uptime)

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering