SLIs, SLOs, and Error Budgets (The Mathematics of Reliability)
Audience: SRE teams, product managers, and engineering leaders defining and measuring service reliability.
This article assumes:
- Perfect reliability (100% uptime) is impossible and economically irrational.
- Users care about their experience, not your internal metrics.
- Reliability work competes with feature development for engineering time.
- What you measure shapes what you optimize (Goodhart's Law applies).
Challenge: Your "five 9s" promise is meaningless
Scenario
Your sales team promises customers "99.999% uptime" (five 9s).
Then reality hits:
- Your monitoring shows 99.999% uptime ✓
- But customers complain the service is "always down"
- Your CEO asks: "Why are customers angry if we're meeting our SLA?"
What went wrong?
Interactive question (pause and think)
Which is the most likely culprit?
- Customers are wrong (service was actually up)
- Monitoring measures the wrong thing
- 99.999% doesn't mean what sales thinks it means
- All of the above
Take 10 seconds.
Progressive reveal (question -> think -> answer)
Answer: (2) and (3).
Classic failures:
- Monitoring checks health endpoint (returns 200), but payment processing is broken
- "Uptime" measured as "server responds" not "user can complete their task"
- 99.999% allows 5 minutes downtime/year, but if it all happens during Black Friday...
Real-world analogy (airline on-time performance)
An airline claims "99% on-time departure."
But what if:
- They measure "door closed on time" not "wheels up on time"
- Flights delayed by 59 minutes count as "on time" (60 min = late)
- They only measure the specific route you don't fly
The metric is technically accurate but completely misleading.
Key insight
SLIs (Service Level Indicators) must measure what users actually care about, not what's easy to measure.
Challenge question
If you could only measure ONE metric to represent reliability, what would it be?
Mental model - Error budgets as a reliability currency
Scenario
Your product team wants to ship a new feature. Your SRE team says "We can't deploy - we're out of error budget."
Product team: "What? We have a deadline!"
SRE team: "We agreed to 99.9% uptime. We've already had 99.85% this month. We're out of budget."
Interactive question (pause and think)
Who's right?
A. Product team (business needs matter more than arbitrary targets)
B. SRE team (reliability is non-negotiable)
C. Both (they need a better framework for trade-offs)
Progressive reveal
Answer: C.
This is precisely what error budgets solve: a shared framework for reliability vs velocity trade-offs.
Mental model
Think of error budgets as:
- Financial budget: You have $X to spend; spend wisely
- Risk budget: You can afford Y failures; don't waste them
- Innovation currency: Reliability headroom enables experimentation
The goal: Make reliability a measurable trade-off, not a religious war.
Real-world parallel (construction project buffers)
A building project has a time buffer. If weather delays use up the buffer, the project stops adding risky features (complex facades, custom materials). They focus on finishing reliably.
If the buffer is intact, they can afford some experimentation.
Key insight
Error budgets formalize the trade-off: reliability is not "maximize uptime" but "maintain agreed reliability while moving fast."
Challenge question
If you have 99.9% SLO (43 minutes downtime/month), should you spend your error budget on:
- Testing risky new features in production
- Planned maintenance windows
- Neither (save it for unexpected failures)
What SLI, SLO, SLA actually mean (and why people confuse them)
Scenario
Your team debates reliability targets. Three acronyms get thrown around interchangeably.
Let's clarify.
Definitions
SLI (Service Level Indicator):
- What it is: A quantitative measure of service behavior
- Examples: Request success rate, latency percentile, throughput
- Key point: The metric itself, not the target
SLO (Service Level Objective):
- What it is: A target value or range for an SLI
- Examples: "99.9% of requests succeed", "p99 latency < 500ms"
- Key point: Internal goal, sets expectations
SLA (Service Level Agreement):
- What it is: A business contract with consequences for missing SLO
- Examples: "99.9% uptime or credits", "p99 < 500ms or refund"
- Key point: Legal/financial commitment
Visual: relationship between SLI, SLO, SLA
Interactive question
Your SLI shows 99.6% uptime. Your SLO is 99.9%. Your SLA is 99.5%.
What happens?
A. You violated the SLA (customer gets refund)
B. You missed the SLO (internal alarm)
C. Nothing (still above SLA)
D. B and C
Progressive reveal
Answer: D.
- Missed SLO → internal response (freeze risky changes, focus on reliability)
- Still met SLA → no customer impact, no refunds
- The buffer between SLO and SLA is intentional
Common confusions
Mistake 1: SLO = SLA
- Wrong: "Our SLO is our customer promise"
- Right: "Our SLA is our customer promise; our SLO is stricter to give us early warning"
Mistake 2: More nines = better
- Wrong: "99.999% is better than 99.9%"
- Right: "99.999% costs exponentially more and may not improve user experience"
Mistake 3: Uptime is the only SLI that matters
- Wrong: "We have 99.9% uptime so we're reliable"
- Right: "Uptime doesn't measure latency, data correctness, or user-perceived availability"
Key insight
SLI is the thermometer. SLO is the target temperature. SLA is the contract saying "we'll keep it above freezing."
Challenge question
Should your SLO be tighter than your SLA? By how much? What's the cost of the buffer?
Choosing good SLIs - what to measure
Scenario
You're defining SLIs for an e-commerce checkout service.
Bad SLI: "Server CPU < 80%"
- Users don't care about your CPU
- CPU can be fine while checkout is broken
Good SLI: "Checkout success rate > 99.9%"
- Directly maps to user experience
Interactive question (pause and think)
For a payment processing service, which SLI is most valuable?
A. Server uptime percentage
B. Request success rate (2xx HTTP responses)
C. Payment authorization success rate
D. Database query latency
Progressive reveal
Answer: C.
- A: Servers can be "up" while payments fail
- B: 200 OK doesn't mean payment succeeded (could be 200 {"error": "declined"})
- C: Directly measures user intent
- D: Internal metric, doesn't map to user experience
SLI selection framework
The "user-centric" test:
Ask: "Would a user notice if this SLI degraded?"
✓ Good SLIs (user-visible):
- Request success rate (from user's perspective)
- Request latency (time to complete user action)
- Data freshness (stale data = bad UX)
- Throughput (can system handle user load)
✗ Bad SLIs (invisible to users):
- CPU utilization (infrastructure metric)
- Memory usage (internal resource)
- Log error rate (unless it correlates with user impact)
The "actionable" test:
Ask: "If this SLI degrades, can we take meaningful action?"
✓ Good: "Checkout success rate dropped to 95%" → investigate payment gateway
✗ Bad: "System health score dropped to 80%" → what does that even mean?
SLI categories
Example: Multi-SLI service
Production insight: Google's "The Four Golden Signals"
Google SRE recommends focusing on:
- Latency: How long requests take
- Traffic: How much demand
- Errors: Rate of failed requests
- Saturation: How "full" the service is
Start simple. Add complexity only when needed.
Key insight
The best SLI is one that, when it degrades, a user has a bad experience. If users don't care, it's not an SLI - it's an internal metric.
Challenge question
For a video streaming service, should you have separate SLIs for "start time" and "buffering rate" or combine them into one SLI?
Setting SLO targets - the Goldilocks problem
Scenario
You need to set an SLO target for your API.
Too high (99.999%): Engineers burn out chasing impossible reliability. Costs skyrocket.
Too low (95%): Customers churn. Product team can't sell.
How do you find the right target?
Think about it
What's the cost of each additional "9" of reliability?
Interactive question (pause and think)
Your current reliability is 99.5%. Customers are complaining. What's your next SLO target?
A. 99.99% (skip ahead, be the best)
B. 99.9% (incremental improvement)
C. 99.5% (match current reality, then improve)
Progressive reveal
Answer: Usually B, sometimes C.
Why not A? Going from 99.5% → 99.99% is ~10x harder than 99.5% → 99.9%. Start with achievable targets.
The cost curve of reliability
SLO target selection framework
Step 1: Measure current performance (4 weeks)
Step 2: Talk to users
Step 3: Analyze impact of outages
Step 4: Set targets with buffer
Step 5: Calculate error budget
Common SLO target mistakes
Mistake 1: Matching competitor SLOs blindly
- Wrong: "AWS promises 99.99%, we should too"
- Right: "Our users need 99.9%, and we can deliver it reliably"
Mistake 2: One-size-fits-all SLOs
- Wrong: "All our services have 99.9% uptime SLO"
- Right: "Critical: 99.9%, important: 99%, nice-to-have: 95%"
Mistake 3: Aspirational SLOs
- Wrong: "We're at 99%, let's set SLO to 99.99% to motivate the team"
- Right: "We're at 99%, let's set SLO to 99.5% and actually achieve it, then increase"
Mistake 4: Forgetting dependencies
- Wrong: "We'll be 99.99% reliable" (but your database vendor only promises 99.9%)
- Right: "Our SLO must account for dependency SLOs" (can't exceed weakest link)
Key insight
Your SLO should be: (1) Achievable with current architecture, (2) Meaningful to users, (3) Tighter than your SLA. If it's not all three, revisit.
Challenge question
Your SLO is 99.9%, but you're consistently achieving 99.95%. Should you tighten your SLO or keep the buffer?
Error budgets - the engine of risk-taking
Scenario
Month starts:
- SLO: 99.9% success rate
- Error budget: 43 minutes of downtime allowed
By mid-month:
- 30 minutes of downtime already burned (bad deployment)
- 13 minutes remaining
Product team wants to ship a major feature.
Interactive question (pause and think)
What should you do?
A. Ship anyway (business pressure)
B. Freeze deploys until next month (protect SLO)
C. Ship only low-risk changes (burn budget carefully)
D. Revise SLO downward (adjust expectations)
Progressive reveal
Answer: C, with team agreement.
Error budgets are meant to be spent, not hoarded. But spend wisely.
Error budget mechanics
Definition:
Consumption tracking:
Policy framework:
Example: Error budget in practice
Burn rate alerting
Production insight: How Netflix uses error budgets
Netflix:
- Each service has error budget
- Teams can spend budget on experiments (chaos, canary, A/B tests)
- When budget low, automated policies kick in (deploy freeze)
- Incentivizes reliability: more reliable = more freedom to innovate
Key insight
Error budgets convert reliability from a vague goal into a concrete currency. You can spend it on innovation or hoard it. Choose wisely.
Challenge question
Your error budget is 43 min/month. Should you spend it all on: (1) one big risky feature launch, or (2) many small experiments? What's the trade-off?
Common SLO anti-patterns
Scenario
You've set SLOs. But things feel off. Users are still unhappy. Engineers are frustrated.
SLO anti-patterns catalog
Anti-pattern 1: Vanity SLOs
Anti-pattern 2: Invisible SLOs
Anti-pattern 3: Inside-out SLOs
Anti-pattern 4: One-size-fits-all
Anti-pattern 5: No SLO enforcement
Anti-pattern 6: Missing user-journey SLOs
Real-world failure: Compound SLO violations
Key insight
Bad SLOs are worse than no SLOs. They create false confidence and misallocate engineering effort.
Challenge question
If your SLO is consistently exceeded (always 99.99% when target is 99.9%), is that good or bad?
Measuring SLIs in practice - implementation patterns
Scenario
You've defined your SLIs. Now you need to actually measure them.
Where do you measure? How often? What's the source of truth?
Measurement approaches
Client-side measurement:
✓ Pros:
- Sees actual user experience (network latency, client-side rendering)
- Accounts for CDN, geographic distribution
✗ Cons:
- Can't track requests that never arrive (network failures)
- Sampling bias (only successful page loads report)
- User privacy concerns
Server-side measurement:
✓ Pros:
- Complete picture (all requests, including failures)
- No sampling bias
- Easier to implement at scale
✗ Cons:
- Doesn't see client-side latency (network, rendering)
- May miss user-perceived failures (e.g., JS errors)
Synthetic monitoring (probes):
✓ Pros:
- Detects issues before users do
- Consistent baseline (not affected by traffic patterns)
- Works even when no user traffic
✗ Cons:
- May not match real user behavior
- Can't measure user-specific features (auth, personalization)
- Creates artificial load
Measurement Approaches Comparison
| Approach | Sees real user experience | Complete coverage | Detects issues proactively | Best for |
|---|
| Client-side (RUM) | Yes | No (only successful loads) | No | User-facing latency, CWV |
| Server-side | No (misses network/render) | Yes (all requests) | No | API reliability, error rates |
| Synthetic monitoring | No (artificial) | No (sampled) | Yes | Uptime monitoring, multi-region |
Recommended: Multi-layer measurement
SLI calculation methods
Method 1: Request-based (most common)
Method 2: Windows-based (for batch/streaming)
Method 3: Burn rate (for error budgets)
Key insight
SLI measurement is not "set it and forget it." Continuously validate: does this metric reflect user experience? If not, adjust.
Challenge question
Your server-side SLI shows 99.9% success rate, but your client-side RUM shows 95% success rate. What's happening?
Final synthesis - Design a complete SLO framework
Synthesis challenge
You're the SRE lead for a ride-sharing platform.
Requirements:
- 50 million rides/month
- Mobile app (iOS + Android)
- Backend services: ride matching, payments, notifications, maps
- Users care about: finding a ride quickly, accurate pricing, reliable payment
Current state:
- No SLOs defined
- Monitoring exists but ad-hoc
- Incidents happen but no clear severity criteria
- Engineers debate "is this reliable enough?" constantly
Your tasks (pause and think)
- Define 3-5 key SLIs (what to measure?)
- Set SLO targets for each SLI (what's good enough?)
- Calculate error budgets (how much failure is acceptable?)
- Design error budget policy (what happens when budget runs low?)
- Implement measurement strategy (how to track?)
- Define escalation triggers (when to page someone?)
Write down your framework.
Progressive reveal (one possible solution)
1. Key SLIs:
2. SLO targets:
3. Error budgets:
4. Error budget policy:
5. Measurement strategy:
6. Escalation triggers:
Key insight
A complete SLO framework turns reliability from art to science: measurable, predictable, and blameless.
Final challenge question
Your SLOs are met (99.9% success rate), but your customer NPS (Net Promoter Score) is dropping. What's wrong with your SLO framework?
Appendix: Quick checklist (printable)
SLI definition checklist:
SLO target checklist:
Error budget checklist:
Measurement checklist:
Operational checklist:
Red flags (reassess SLOs):
Key Takeaways
- SLIs (Service Level Indicators) are the metrics that define service health — request latency, error rate, and availability are the most common
- SLOs (Service Level Objectives) set targets for SLIs — e.g., 99.9% of requests complete in under 200ms
- Error budgets are the allowed amount of unreliability — a 99.9% SLO means you have 43.8 minutes of downtime budget per month
- When error budget is exhausted, freeze feature releases and focus on reliability — this creates a natural balance between velocity and stability
- SLAs are contractual commitments with financial penalties — SLOs are internal targets; SLAs should be looser than SLOs to provide a safety margin