Ragan McGill

Jun 03, 2025 Ragan McGill General

Designing for Trust - Why SLOs, Error Budgets, and Toil Matter in Platform Engineering

In my last article, I argued that internal platforms should be treated as products - thoughtfully designed, user-centred, and built to empower engineers to move quickly and confidently. And once you adopt this mindset, a natural question follows:

How do you build a platform that earns trust, not just usage? The answer lies in how you define and manage reliability.

It’s not enough to say “we’re up” - modern teams need a shared understanding of what “good enough” looks like, how much failure is acceptable, and how operational pain is addressed. That’s where service level indicators (SLIs), service level objectives (SLOs), error budgets, and a deliberate approach to toil come in.

These aren’t just operational metrics - they’re the foundations of dependable, developer-friendly platforms.

Start with clarity: Key terms that shape reliability

Before we can manage reliability, we need to define it in clear, measurable terms:

SLI (Service Level Indicator): A specific metric that tells you how a system is performing from a user’s perspective. Examples might include request success rate, build time, or environment provisioning latency.
SLO (Service Level Objective): A target or threshold for that metric, typically expressed as a percentage over a rolling window (e.g. “95% of builds complete in under 10 minutes over the last 30 days”). This is your agreed definition of “good enough”.
SLA (Service Level Agreement): A formal, often contractual commitment to customers or stakeholders, typically accompanied by penalties if not met. Unlike SLOs, which are internal alignment tools, SLAs are external guarantees.
Error Budget: The difference between perfect performance (100%) and your SLO target. If your SLO is 99.9%, your error budget is 0.1%. It’s a buffer for experimentation and change - a can be considered a safe space for failure.
Toil: Manual, repetitive work that adds no lasting value, is automatable, and tends to scale linearly with system growth. It distracts engineers from higher-impact work and erodes team energy over time.

Reliability isn’t perfection - It’s predictability

Let’s step out of the tech world for a moment. Think about your daily train commute.

You take the same 7:43 service every morning. Most days, it arrives on time, and your routine flows without friction. You grab a coffee. You arrive at work just as planned.

But imagine if - without warning - that same train starts arriving unpredictably. Some days it’s early. Others, 10–15 minutes late. And no one tells you why.

Naturally, your trust in the service erodes. You start adjusting:

Arriving earlier, “just in case”
Cancelling plans you’d otherwise keep
Feeling anxious, not efficient

Here, the SLI is simple: on-time arrival.

The SLO might be: “the train should arrive within five minutes of schedule, 97% of the time.”

And that 3% wiggle room? That’s your error budget - a margin for signal failures, bad weather, or the occasional unavoidable delay. More on this later.

The difference isn’t just performance. It’s confidence. And confidence is what lets users plan, trust, and build on top of your platform. This is what SLOs do - they turn performance into a promise. Not one of perfection, but of predictability.

Because users don’t need your platform to be flawless. They need to know what to expect - and that when it drifts, someone’s paying attention.

Error Budgets: Where innovation meets risk

When we talk about SLOs, we’re not just setting a target - we’re defining a tolerance for imperfection. That tolerance is known as the error budget, and it represents the small amount of failure we’re willing to accept in exchange for faster delivery, innovation, or operational flexibility.

Calculating it is simple: Error Budget = 100% – SLO target. Then apply that percentage to the total time window you’re measuring against.

Let’s look at what that means in a typical 30-day month (720 hours):

99.9% SLO ("three nines") gives you an error budget of 0.1%, or 43.2 minutes of allowable downtime.
99.99% SLO ("four nines") tightens that to just 4.32 minutes.
99.999% SLO ("five nines") leaves only 25.9 seconds.

The difference is staggering. Moving from three nines to five nines doesn’t just mean more reliability - it means 100x less room for failure. That’s a major strategic decision. Every extra nine you commit to requires more investment in automation, redundancy, observability, and incident response.

Error budgets help teams balance resilience and velocity. They give you the breathing room to release frequently, experiment safely, and build trust - without over-engineering for unreachable perfection.

Toil: The silent killer of engineering progress

Imagine you're in a small rowboat, headed for a clear destination. You have a map, a plan, and a crew. But there’s a leak - not a crisis, just a slow, steady trickle.

So every day, before you can row, you have to bail. Bucket after bucket. It’s mindless, repetitive, and thankless. At first, you talk about fixing the hole. But over time, bailing becomes normal. You optimise for it. You assign people to it. You build rituals around it.

You’ve accepted the leak - and slowed the journey. This is what toil feels like in digital engineering. Toil is the manual, repetitive work that:

Doesn’t scale
Doesn’t teach
Doesn’t improve outcomes

But it does consume time, burn out good people, and quietly slow teams down. The worst part? The longer you tolerate it, the more invisible it becomes - absorbed into culture, accepted as "just the way things work."

Examples of toil include:

Manually re-running flaky CI pipelines
Restarting services that crash silently
Repeatedly fixing the same environment issue
Triaging alerts you know will self-resolve

Toil doesn’t show up on roadmaps. It’s not demoed. But it accumulates. Fix the leak, not just the symptoms. That’s how you reclaim velocity, morale, and meaningful progress.

Why reliability tools belong in every platform team’s toolbox

As platform engineers, our job is to provide reliable infrastructure, tools, and workflows that enable delivery teams to move faster, safer, and with less friction. But reliability, like usability, is experienced - not just measured.

Without clear SLOs, platform reliability becomes a vague aspiration, not a managed outcome.
Without error budgets, teams can struggle to balance speed and stability.
Without addressing toil, reliability becomes brittle - and team morale suffers.

These mechanisms bring structure to how we think about service health, how we respond to incidents, and how we prioritise work.

Applying This in Practice

Here’s how to make this real:

Define Meaningful SLIs

Start by identifying where reliability matters most to your users. For example:

Time to provision a developer environment
Success rate of CI/CD jobs
Time to respond to an access request

Choose indicators that reflect what engineers feel when things go wrong.

Agree on SLOs That Reflect Reality

SLOs shouldn’t be aspirational - they should be achievable and useful. A good SLO gives teams confidence to act and sets the right expectations with users.

For instance:

“95% of builds succeed on first attempt within 10 minutes”
“99.5% of test environments are ready within 3 minutes of request”

Start with just one or two per service. Make them visible. Review them regularly.

Use Error Budgets to Guide Decision-Making

If your error budget is intact, keep shipping. If you’re burning through it, slow down and focus on stability. This encourages healthy tension between innovation and resilience, without relying on vague instincts or subjective judgment.

Error budgets also help reset the conversation during retrospectives. Instead of asking “did we break anything?”, try asking “did we operate within our agreed risk tolerance?”

Track and Reduce Toil Proactively

Toil creeps in silently. A small manual step here, a repeated workaround there - over time, they steal capacity from your roadmap.

Make toil visible by asking:

What tasks are repeated frequently?
What causes unnecessary handoffs or delays?
What could be automated if someone had time?

Allocate time every sprint or cycle to remove or reduce toil. Treat it as you would technical debt: it’s not always urgent, but it’s always important.

Rethinking What “Reliable” Means

One of the biggest mindset shifts when adopting these practices is this:

Perfection is not the goal. Predictability is.

Your platform doesn’t have to be flawless. But it should behave in ways that are understandable, recoverable, and fair. When something goes wrong, users should feel confident that it will be resolved quickly, that the team is already working to prevent it happening again, and that transparent communication is flowing in a proactive manner.

That’s what builds trust. And trust is the true UX of any internal platform.

In Summary

If you're serious about treating your platform as a product, reliability can’t be an afterthought - it must be designed in.

SLIs and SLOs give teams a shared definition of “good enough”
Error budgets create guardrails for fast, safe iteration
Toil reduction protects energy, creativity, and long-term sustainability

These aren’t just tools - they’re how great teams balance reliability and agility, trust and speed, autonomy and alignment.

Ragan McGill

Engineering leader blending strategy, culture, and craft to build high-performing teams and future-ready platforms. I drive transformation through autonomy, continuous improvement, and data-driven excellence - creating environments where people thrive, innovation flourishes, and outcomes matter. Passionate about empowering others and reshaping engineering for impact at scale. Let’s build better, together.