Using Error Budgets as a Business Tool
Business leaders like the sound of 100% uptime. It feels clean and absolute: your service never stops, money never stops flowing, and customers never get frustrated. But here’s the truth — 100% reliability is impossible and trying to chase it will bleed your company dry. The cost of squeezing out each extra “nine” of availability grows faster than the benefit you’ll ever get back.
This article — the first in a three‑part series on Site Reliability Engineering (SRE) for business owners — explains why 100% is a myth, how to think about the true cost of reliability, and why an error budget is the tool every product team should own.
Myth of 100%
Think about uptime as “nines”: 99%, 99.9%, 99.99%. Each step higher means much less downtime:
99% (“two nines”) ≈ 7 hours of downtime per month
99.9% (“three nines”) ≈ 43 minutes per month
99.99% (“four nines”) ≈ 4 minutes per month
99.999% (“five nines”) ≈ 26 seconds per month
Getting from three to four nines requires serious investment: multi‑region systems, automatic failover, people on call 24/7. Chasing five nines? That’s aviation‑grade reliability — and it’s priced like aviation — extremely costly to engineer at scale.
Now stretch that logic to 100%. That means no downtime, ever — across data centers / clouds, networks, humans, and third‑party services. It’s simply not going to happen.
So there is a point, at which spending more money to squeeze out seconds of downtime is a worse deal than just accepting the risk.
SRE: Reliability as a Business Function
This is where Site Reliability Engineering comes in. Born at Google, SRE is about making reliability measurable and manageable. It gives you a framework to ask: What level of reliability do our customers actually need, and how much are we willing to pay for it?
Think of SRE like the finance department, but for uptime. Just as finance manages dollars and cents, SRE manages minutes of downtime and performance thresholds.
A Simple Contract
The most practical tool in SRE is the error budget. It’s just a way of saying: here’s how much downtime or failure we can live with.
Let’s say your company promises 99.9% uptime. That’s 43 minutes of downtime per month. Those 43 minutes are your budget.
If the budget has room left: the product team can move fast, ship new features, and take risks. Innovation gets the green light.
If the budget is spent: no more gambling. The priority shifts to fixing problems, hardening the system, and earning trust back.
Ship it or chill — the budget makes the call.
And here’s the important part: the product team, not just engineering, should own that budget. Why? Because product is the one deciding whether to push a risky release to meet a market demand, or to hold back for stability. The budget forces product and engineering to be on the same page, with numbers instead of opinions.
Downtime Hurts
Downtime isn’t abstract. It’s lost opportunities, unhappy customers, and a dent in your reputation:
E‑commerce: 30 minutes down on Black Friday can wipe out a month’s profit.
Fintech: even a 0.01% error rate means thousands of failed payments — plus angry calls and maybe fines.
SaaS: repeat outages lower NPS and increase churn faster than any discount campaign can fix.
An error budget tells you exactly how much pain you’re willing to risk — and whether you’re still inside safe territory.
A Common Language
Error budgets do more than guide engineers. They give everyone a shared reality:
Executives see reliability as a line item, right next to revenue and churn.
Product teams understand the trade‑off between speed and stability, and they control when to push and when to pause.
Engineers escape blame, because outages aren’t “your fault” — they’re budget spending.
It’s a simple contract: here’s the tolerance we agreed on. If we cross it, we change our behavior. No drama, no finger‑pointing.
What This Means
If you run the P&L, here’s how to use reliability like a lever — not a prayer:
Set the target, in minutes. Pick an SLO per critical journey (sign‑in, checkout, payments) and approve the monthly error budget in minutes. Example: “Checkout SLO 99.9% -> 43 minutes.”
Make Product own the budget. Tie release gates to it: green = ship, amber (<50% left mid‑month) = reduce risk, red (exhausted) = ship fixes only until stability is restored.
Put it on your exec dashboard. Show SLO attainment, budget remaining (min), cost of incidents, and top risks. Review weekly like revenue.
Align promises with spend. Don’t sell an SLA you won’t fund. If you add a “nine,” add budget for people, tooling, and redundancy.
Treat outages as learning, not blame. Mandate blameless postmortems with clear owners and due dates; track completion as a KPI.
Bottom line: reliability is a business decision. Define how much imperfection you can afford, fund it properly, and let the error budget steer speed versus safety.
What’s Next
This was the foundation. In the next post, we’ll go deeper into error budgets, SLAs, and SLOs — the full toolkit for translating reliability into dollars and customer satisfaction. We’ll cover simple ways to measure reliability, keep costs sane, and shape a roadmap that matches what your customers actually notice.
Because SRE isn’t about technical perfection. It’s about building systems and companies — that know exactly how much imperfection they can afford.