• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Percentage of Incidents Linked to Known Architectural Risks

Description

% of Incidents Linked to Known Architectural Risks measures the proportion of production incidents that originate from components or patterns previously identified as architectural weaknesses or high-risk areas. This metric provides visibility into whether the system’s known weaknesses are being addressed — or simply tolerated over time.

High percentages may indicate underinvestment in known problem areas. Conversely, a reduction in this number suggests that architectural investments are targeting and resolving systemic weaknesses.

How to Use

What to Measure

  • Total number of incidents in a given period.

  • Number of those incidents linked to components or patterns on your architectural risk register.

Formula

% Linked to Architectural Risk = (Incidents linked to known risks / Total incidents) x 100

Instrumentation Tips

  • Maintain an up-to-date architecture risk register with clear mappings to systems/components.

  • During postmortems or triage, classify incidents as related or unrelated to known risks.

  • Use tagging in your incident management system (e.g. "known-risk") to support automation.

Why It Matters

  • Informs architectural investment: Helps prioritize high-value refactoring work.

  • Improves resilience: Targeted fixes reduce recurring failures.

  • Drives accountability: Ensures known risks are not ignored indefinitely.

  • Supports learning culture: Encourages teams to reflect and improve based on incident trends.

Best Practices

  • Regularly review and update your architectural risk register.

  • Establish clear ownership of high-risk areas.

  • Include “known risk?” as a question in incident postmortems.

  • Use dashboards to visualize incident clustering by component or system boundary.

  • Align risk remediation efforts with OKRs or capacity planning.

Common Pitfalls

  • Risk register is outdated or too vague to be useful.

  • No clear link between incident root causes and architecture components.

  • Underreporting due to lack of consistent incident analysis.

  • Over-attribution to known risks without validating root cause.

Signals of Success

  • Decreasing % of incidents linked to known risks over time.

  • Architectural work is visibly informed by incident trends.

  • Fewer repeat incidents in high-risk components.

  • Engineering teams proactively discuss risk patterns during planning and retrospectives.

Related Measures

  • [[Recurring Incident Rate]]

  • [[Postmortem Completion Rate]]

  • [[Mean Time to Detect (MTTD)]]

  • [[Technical Debt Remediation Throughput]]

  • [[Architecture Refactoring Frequency]]

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering