In my last article, I argued that internal platforms should be treated as products - thoughtfully designed, user-centred, and built to empower engineers to move quickly and confidently. And once you adopt this mindset, a natural question follows:
How do you build a platform that earns trust, not just usage? The answer lies in how you define and manage reliability.
It’s not enough to say “we’re up” - modern teams need a shared understanding of what “good enough” looks like, how much failure is acceptable, and how operational pain is addressed. That’s where service level indicators (SLIs), service level objectives (SLOs), error budgets, and a deliberate approach to toil come in.
These aren’t just operational metrics - they’re the foundations of dependable, developer-friendly platforms.
Before we can manage reliability, we need to define it in clear, measurable terms:
Let’s step out of the tech world for a moment. Think about your daily train commute.
You take the same 7:43 service every morning. Most days, it arrives on time, and your routine flows without friction. You grab a coffee. You arrive at work just as planned.
But imagine if - without warning - that same train starts arriving unpredictably. Some days it’s early. Others, 10–15 minutes late. And no one tells you why.
Naturally, your trust in the service erodes. You start adjusting:
Here, the SLI is simple: on-time arrival.
The SLO might be: “the train should arrive within five minutes of schedule, 97% of the time.”
And that 3% wiggle room? That’s your error budget - a margin for signal failures, bad weather, or the occasional unavoidable delay. More on this later.
The difference isn’t just performance. It’s confidence. And confidence is what lets users plan, trust, and build on top of your platform. This is what SLOs do - they turn performance into a promise. Not one of perfection, but of predictability.
Because users don’t need your platform to be flawless. They need to know what to expect - and that when it drifts, someone’s paying attention.
When we talk about SLOs, we’re not just setting a target - we’re defining a tolerance for imperfection. That tolerance is known as the error budget, and it represents the small amount of failure we’re willing to accept in exchange for faster delivery, innovation, or operational flexibility.
Calculating it is simple: Error Budget = 100% – SLO target. Then apply that percentage to the total time window you’re measuring against.
Let’s look at what that means in a typical 30-day month (720 hours):
The difference is staggering. Moving from three nines to five nines doesn’t just mean more reliability - it means 100x less room for failure. That’s a major strategic decision. Every extra nine you commit to requires more investment in automation, redundancy, observability, and incident response.
Error budgets help teams balance resilience and velocity. They give you the breathing room to release frequently, experiment safely, and build trust - without over-engineering for unreachable perfection.
Imagine you're in a small rowboat, headed for a clear destination. You have a map, a plan, and a crew. But there’s a leak - not a crisis, just a slow, steady trickle.
So every day, before you can row, you have to bail. Bucket after bucket. It’s mindless, repetitive, and thankless. At first, you talk about fixing the hole. But over time, bailing becomes normal. You optimise for it. You assign people to it. You build rituals around it.
You’ve accepted the leak - and slowed the journey. This is what toil feels like in digital engineering. Toil is the manual, repetitive work that:
But it does consume time, burn out good people, and quietly slow teams down. The worst part? The longer you tolerate it, the more invisible it becomes - absorbed into culture, accepted as "just the way things work."
Examples of toil include:
Toil doesn’t show up on roadmaps. It’s not demoed. But it accumulates. Fix the leak, not just the symptoms. That’s how you reclaim velocity, morale, and meaningful progress.
As platform engineers, our job is to provide reliable infrastructure, tools, and workflows that enable delivery teams to move faster, safer, and with less friction. But reliability, like usability, is experienced - not just measured.
These mechanisms bring structure to how we think about service health, how we respond to incidents, and how we prioritise work.
Here’s how to make this real:
Start by identifying where reliability matters most to your users. For example:
Choose indicators that reflect what engineers feel when things go wrong.
SLOs shouldn’t be aspirational - they should be achievable and useful. A good SLO gives teams confidence to act and sets the right expectations with users.
For instance:
Start with just one or two per service. Make them visible. Review them regularly.
If your error budget is intact, keep shipping. If you’re burning through it, slow down and focus on stability. This encourages healthy tension between innovation and resilience, without relying on vague instincts or subjective judgment.
Error budgets also help reset the conversation during retrospectives. Instead of asking “did we break anything?”, try asking “did we operate within our agreed risk tolerance?”
Toil creeps in silently. A small manual step here, a repeated workaround there - over time, they steal capacity from your roadmap.
Make toil visible by asking:
Allocate time every sprint or cycle to remove or reduce toil. Treat it as you would technical debt: it’s not always urgent, but it’s always important.
One of the biggest mindset shifts when adopting these practices is this:
Perfection is not the goal. Predictability is.
Your platform doesn’t have to be flawless. But it should behave in ways that are understandable, recoverable, and fair. When something goes wrong, users should feel confident that it will be resolved quickly, that the team is already working to prevent it happening again, and that transparent communication is flowing in a proactive manner.
That’s what builds trust. And trust is the true UX of any internal platform.
If you're serious about treating your platform as a product, reliability can’t be an afterthought - it must be designed in.
These aren’t just tools - they’re how great teams balance reliability and agility, trust and speed, autonomy and alignment.
Engineering leader blending strategy, culture, and craft to build high-performing teams and future-ready platforms. I drive transformation through autonomy, continuous improvement, and data-driven excellence - creating environments where people thrive, innovation flourishes, and outcomes matter. Passionate about empowering others and reshaping engineering for impact at scale. Let’s build better, together.