Ragan McGill

Practice : Automated Rollbacks

Purpose and Strategic Importance

Automated rollbacks allow systems to quickly and safely revert to a known-good state when a deployment causes instability, performance issues, or user impact. By automating rollback logic, teams reduce mean time to recovery (MTTR), minimise manual intervention, and increase confidence in deploying at speed.

This practice supports resilience, continuous delivery, and customer trust - enabling teams to move fast without fear of long outages or complex undo procedures.

Description of the Practice

Rollback is implemented as part of the deployment pipeline and can be triggered automatically.
Trigger conditions may include failed health checks, performance degradation, error spikes, or customer complaints.
The previous working version is retained and ready for immediate re-deployment.
Rollback logic is tested regularly, like any other code path.
Monitoring and alerting systems provide real-time signals to initiate rollbacks when needed.

How to Practise It (Playbook)

1. Getting Started

Ensure your deployment process supports versioned artefacts and history.
Implement post-deploy health checks that validate application behaviour and dependencies.
Create a “rollback” step in your pipeline that reverts to the last successful deployment.
Use toggles or config switches to disable new features quickly if needed.

2. Scaling and Maturing

Define clear rollback policies based on thresholds (e.g. 5xx errors, latency, failed canary tests).
Automate rollback triggers using observability tools (e.g. Datadog, Prometheus, New Relic).
Build rollback dashboards to provide visibility into current and previous versions.
Regularly test rollbacks in staging and production to ensure they work as expected.
Align rollback capabilities with business-critical SLAs and incident response playbooks.

3. Team Behaviours to Encourage

Practice rolling back regularly - not just during failure.
Treat rollback readiness as part of the Definition of Done.
Review and learn from every rollback to improve signals and automation.
Build confidence through simulation - run chaos drills or failure rehearsals.

4. Watch Out For…

Assuming rollbacks will “just work” without testing them.
State changes (e.g. database migrations) that make rollback unsafe or impossible.
Lack of monitoring granularity - if you can’t detect the problem, rollback won’t help.
Manual rollback steps that delay response or require specialist knowledge.

5. Signals of Success

Failed deployments trigger automated rollback within minutes.
Customer impact is reduced through rapid response and minimal downtime.
Rollbacks are observed, rehearsed, and improved over time.
Post-incident reviews include rollback effectiveness and readiness.
Teams deploy with confidence, knowing recovery is fast and safe.