Reliability And Resilience · Chapter 48 of 51

SLIS SLOS && Error Budgets

Akhil Sharma

 20 min 

← → to navigate

SLIs, SLOs, and Error Budgets (The Mathematics of Reliability)

Audience: SRE teams, product managers, and engineering leaders defining and measuring service reliability.

This article assumes:

Perfect reliability (100% uptime) is impossible and economically irrational.
Users care about their experience, not your internal metrics.
Reliability work competes with feature development for engineering time.
What you measure shapes what you optimize (Goodhart's Law applies).

Challenge: Your "five 9s" promise is meaningless

Scenario

Your sales team promises customers "99.999% uptime" (five 9s).

Then reality hits:

Your monitoring shows 99.999% uptime ✓
But customers complain the service is "always down"
Your CEO asks: "Why are customers angry if we're meeting our SLA?"

What went wrong?

Interactive question (pause and think)

Which is the most likely culprit?

Customers are wrong (service was actually up)
Monitoring measures the wrong thing
99.999% doesn't mean what sales thinks it means
All of the above

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (2) and (3).

Classic failures:

Monitoring checks health endpoint (returns 200), but payment processing is broken
"Uptime" measured as "server responds" not "user can complete their task"
99.999% allows 5 minutes downtime/year, but if it all happens during Black Friday...

Real-world analogy (airline on-time performance)

An airline claims "99% on-time departure."

But what if:

They measure "door closed on time" not "wheels up on time"
Flights delayed by 59 minutes count as "on time" (60 min = late)
They only measure the specific route you don't fly

The metric is technically accurate but completely misleading.

Key insight

SLIs (Service Level Indicators) must measure what users actually care about, not what's easy to measure.

Challenge question

If you could only measure ONE metric to represent reliability, what would it be?

Mental model - Error budgets as a reliability currency

Scenario

Your product team wants to ship a new feature. Your SRE team says "We can't deploy - we're out of error budget."

Product team: "What? We have a deadline!"

SRE team: "We agreed to 99.9% uptime. We've already had 99.85% this month. We're out of budget."

Interactive question (pause and think)

Who's right?

A. Product team (business needs matter more than arbitrary targets) B. SRE team (reliability is non-negotiable) C. Both (they need a better framework for trade-offs)

Progressive reveal

Answer: C.

This is precisely what error budgets solve: a shared framework for reliability vs velocity trade-offs.

Mental model

Think of error budgets as:

Financial budget: You have $X to spend; spend wisely
Risk budget: You can afford Y failures; don't waste them
Innovation currency: Reliability headroom enables experimentation

The goal: Make reliability a measurable trade-off, not a religious war.

Real-world parallel (construction project buffers)

A building project has a time buffer. If weather delays use up the buffer, the project stops adding risky features (complex facades, custom materials). They focus on finishing reliably.

If the buffer is intact, they can afford some experimentation.

Key insight

Error budgets formalize the trade-off: reliability is not "maximize uptime" but "maintain agreed reliability while moving fast."

Challenge question

If you have 99.9% SLO (43 minutes downtime/month), should you spend your error budget on:

Testing risky new features in production
Planned maintenance windows
Neither (save it for unexpected failures)

What SLI, SLO, SLA actually mean (and why people confuse them)

Scenario

Your team debates reliability targets. Three acronyms get thrown around interchangeably.

Let's clarify.

Definitions

SLI (Service Level Indicator):

What it is: A quantitative measure of service behavior
Examples: Request success rate, latency percentile, throughput
Key point: The metric itself, not the target

SLO (Service Level Objective):

What it is: A target value or range for an SLI
Examples: "99.9% of requests succeed", "p99 latency < 500ms"
Key point: Internal goal, sets expectations

SLA (Service Level Agreement):

What it is: A business contract with consequences for missing SLO
Examples: "99.9% uptime or credits", "p99 < 500ms or refund"
Key point: Legal/financial commitment

Visual: relationship between SLI, SLO, SLA

text

Interactive question

Your SLI shows 99.6% uptime. Your SLO is 99.9%. Your SLA is 99.5%.

What happens?

A. You violated the SLA (customer gets refund) B. You missed the SLO (internal alarm) C. Nothing (still above SLA) D. B and C

Progressive reveal

Answer: D.

Missed SLO → internal response (freeze risky changes, focus on reliability)
Still met SLA → no customer impact, no refunds
The buffer between SLO and SLA is intentional

Common confusions

Mistake 1: SLO = SLA

Wrong: "Our SLO is our customer promise"
Right: "Our SLA is our customer promise; our SLO is stricter to give us early warning"

Mistake 2: More nines = better

Wrong: "99.999% is better than 99.9%"
Right: "99.999% costs exponentially more and may not improve user experience"

Mistake 3: Uptime is the only SLI that matters

Wrong: "We have 99.9% uptime so we're reliable"
Right: "Uptime doesn't measure latency, data correctness, or user-perceived availability"

Key insight

SLI is the thermometer. SLO is the target temperature. SLA is the contract saying "we'll keep it above freezing."

Challenge question

Should your SLO be tighter than your SLA? By how much? What's the cost of the buffer?

Choosing good SLIs - what to measure

Scenario

You're defining SLIs for an e-commerce checkout service.

Bad SLI: "Server CPU < 80%"

Users don't care about your CPU
CPU can be fine while checkout is broken

Good SLI: "Checkout success rate > 99.9%"

Directly maps to user experience

Interactive question (pause and think)

For a payment processing service, which SLI is most valuable?

A. Server uptime percentage B. Request success rate (2xx HTTP responses) C. Payment authorization success rate D. Database query latency

Progressive reveal

Answer: C.

A: Servers can be "up" while payments fail
B: 200 OK doesn't mean payment succeeded (could be 200 {"error": "declined"})
C: Directly measures user intent
D: Internal metric, doesn't map to user experience

SLI selection framework

The "user-centric" test: Ask: "Would a user notice if this SLI degraded?"

✓ Good SLIs (user-visible):

Request success rate (from user's perspective)
Request latency (time to complete user action)
Data freshness (stale data = bad UX)
Throughput (can system handle user load)

✗ Bad SLIs (invisible to users):

CPU utilization (infrastructure metric)
Memory usage (internal resource)
Log error rate (unless it correlates with user impact)

The "actionable" test: Ask: "If this SLI degrades, can we take meaningful action?"

✓ Good: "Checkout success rate dropped to 95%" → investigate payment gateway ✗ Bad: "System health score dropped to 80%" → what does that even mean?

SLI categories

text

Example: Multi-SLI service

yaml

Production insight: Google's "The Four Golden Signals"

Google SRE recommends focusing on:

Latency: How long requests take
Traffic: How much demand
Errors: Rate of failed requests
Saturation: How "full" the service is

Start simple. Add complexity only when needed.

Key insight

The best SLI is one that, when it degrades, a user has a bad experience. If users don't care, it's not an SLI - it's an internal metric.

Challenge question

For a video streaming service, should you have separate SLIs for "start time" and "buffering rate" or combine them into one SLI?

Setting SLO targets - the Goldilocks problem

Scenario

You need to set an SLO target for your API.

Too high (99.999%): Engineers burn out chasing impossible reliability. Costs skyrocket. Too low (95%): Customers churn. Product team can't sell.

How do you find the right target?

Think about it

What's the cost of each additional "9" of reliability?

Interactive question (pause and think)

Your current reliability is 99.5%. Customers are complaining. What's your next SLO target?

A. 99.99% (skip ahead, be the best) B. 99.9% (incremental improvement) C. 99.5% (match current reality, then improve)

Progressive reveal

Answer: Usually B, sometimes C.

Why not A? Going from 99.5% → 99.99% is ~10x harder than 99.5% → 99.9%. Start with achievable targets.

The cost curve of reliability

text

SLO target selection framework

Step 1: Measure current performance (4 weeks)

Step 2: Talk to users

Step 3: Analyze impact of outages

Step 4: Set targets with buffer

Step 5: Calculate error budget

Common SLO target mistakes

Mistake 1: Matching competitor SLOs blindly

Wrong: "AWS promises 99.99%, we should too"
Right: "Our users need 99.9%, and we can deliver it reliably"

Mistake 2: One-size-fits-all SLOs

Wrong: "All our services have 99.9% uptime SLO"
Right: "Critical: 99.9%, important: 99%, nice-to-have: 95%"

Mistake 3: Aspirational SLOs

Wrong: "We're at 99%, let's set SLO to 99.99% to motivate the team"
Right: "We're at 99%, let's set SLO to 99.5% and actually achieve it, then increase"

Mistake 4: Forgetting dependencies

Wrong: "We'll be 99.99% reliable" (but your database vendor only promises 99.9%)
Right: "Our SLO must account for dependency SLOs" (can't exceed weakest link)

Key insight

Your SLO should be: (1) Achievable with current architecture, (2) Meaningful to users, (3) Tighter than your SLA. If it's not all three, revisit.

Challenge question

Your SLO is 99.9%, but you're consistently achieving 99.95%. Should you tighten your SLO or keep the buffer?

Error budgets - the engine of risk-taking

Scenario

Month starts:

SLO: 99.9% success rate
Error budget: 43 minutes of downtime allowed

By mid-month:

30 minutes of downtime already burned (bad deployment)
13 minutes remaining

Product team wants to ship a major feature.

Interactive question (pause and think)

What should you do?

A. Ship anyway (business pressure) B. Freeze deploys until next month (protect SLO) C. Ship only low-risk changes (burn budget carefully) D. Revise SLO downward (adjust expectations)

Progressive reveal

Answer: C, with team agreement.

Error budgets are meant to be spent, not hoarded. But spend wisely.

Error budget mechanics

Definition:

Consumption tracking:

Policy framework:

yaml

Example: Error budget in practice

Burn rate alerting

text

Production insight: How Netflix uses error budgets

Netflix:

Each service has error budget
Teams can spend budget on experiments (chaos, canary, A/B tests)
When budget low, automated policies kick in (deploy freeze)
Incentivizes reliability: more reliable = more freedom to innovate

Key insight

Error budgets convert reliability from a vague goal into a concrete currency. You can spend it on innovation or hoard it. Choose wisely.

Challenge question

Your error budget is 43 min/month. Should you spend it all on: (1) one big risky feature launch, or (2) many small experiments? What's the trade-off?

Common SLO anti-patterns

Scenario

You've set SLOs. But things feel off. Users are still unhappy. Engineers are frustrated.

SLO anti-patterns catalog

Anti-pattern 1: Vanity SLOs

yaml

Anti-pattern 2: Invisible SLOs

yaml

Anti-pattern 3: Inside-out SLOs

yaml

Anti-pattern 4: One-size-fits-all

yaml

Anti-pattern 5: No SLO enforcement

yaml

Anti-pattern 6: Missing user-journey SLOs

yaml

Real-world failure: Compound SLO violations

text

Key insight

Bad SLOs are worse than no SLOs. They create false confidence and misallocate engineering effort.

Challenge question

If your SLO is consistently exceeded (always 99.99% when target is 99.9%), is that good or bad?

Measuring SLIs in practice - implementation patterns

Scenario

You've defined your SLIs. Now you need to actually measure them.

Where do you measure? How often? What's the source of truth?

Measurement approaches

Client-side measurement:

javascript

✓ Pros:

Sees actual user experience (network latency, client-side rendering)
Accounts for CDN, geographic distribution

✗ Cons:

Can't track requests that never arrive (network failures)
Sampling bias (only successful page loads report)
User privacy concerns

Server-side measurement:

✓ Pros:

Complete picture (all requests, including failures)
No sampling bias
Easier to implement at scale

✗ Cons:

Doesn't see client-side latency (network, rendering)
May miss user-perceived failures (e.g., JS errors)

Synthetic monitoring (probes):

yaml

✓ Pros:

Detects issues before users do
Consistent baseline (not affected by traffic patterns)
Works even when no user traffic

✗ Cons:

May not match real user behavior
Can't measure user-specific features (auth, personalization)
Creates artificial load

Measurement Approaches Comparison

Approach	Sees real user experience	Complete coverage	Detects issues proactively	Best for
Client-side (RUM)	Yes	No (only successful loads)	No	User-facing latency, CWV
Server-side	No (misses network/render)	Yes (all requests)	No	API reliability, error rates
Synthetic monitoring	No (artificial)	No (sampled)	Yes	Uptime monitoring, multi-region

Recommended: Multi-layer measurement

text

SLI calculation methods

Method 1: Request-based (most common)

sql

Method 2: Windows-based (for batch/streaming)

python

Method 3: Burn rate (for error budgets)

Key insight

SLI measurement is not "set it and forget it." Continuously validate: does this metric reflect user experience? If not, adjust.

Challenge question

Your server-side SLI shows 99.9% success rate, but your client-side RUM shows 95% success rate. What's happening?

Final synthesis - Design a complete SLO framework

Synthesis challenge

You're the SRE lead for a ride-sharing platform.

Requirements:

50 million rides/month
Mobile app (iOS + Android)
Backend services: ride matching, payments, notifications, maps
Users care about: finding a ride quickly, accurate pricing, reliable payment

Current state:

No SLOs defined
Monitoring exists but ad-hoc
Incidents happen but no clear severity criteria
Engineers debate "is this reliable enough?" constantly

Your tasks (pause and think)

Define 3-5 key SLIs (what to measure?)
Set SLO targets for each SLI (what's good enough?)
Calculate error budgets (how much failure is acceptable?)
Design error budget policy (what happens when budget runs low?)
Implement measurement strategy (how to track?)
Define escalation triggers (when to page someone?)

Write down your framework.

Progressive reveal (one possible solution)

1. Key SLIs:

yaml

2. SLO targets:

yaml

3. Error budgets:

yaml

4. Error budget policy:

yaml

5. Measurement strategy:

yaml

6. Escalation triggers:

yaml

Key insight

A complete SLO framework turns reliability from art to science: measurable, predictable, and blameless.

Final challenge question

Your SLOs are met (99.9% success rate), but your customer NPS (Net Promoter Score) is dropping. What's wrong with your SLO framework?

Appendix: Quick checklist (printable)

SLI definition checklist:

Measures something users care about (not just infrastructure)
Quantifiable (can calculate a percentage or latency)
Actionable (degradation leads to clear debugging)
Fresh enough (measured with acceptable latency)
Covers the right scope (end-to-end, not just component)

SLO target checklist:

Based on current performance (not aspirational)
Informed by user research (what do users tolerate?)
Tighter than SLA (buffer for early warning)
Achievable with current architecture
Different for different criticality features

Error budget checklist:

Calculated from SLO (formula applied correctly)
Tracked in real-time (dashboard + alerts)
Policy defined (what happens at 50%, 25%, 0%?)
Team agrees on policy (not imposed top-down)
Budget is actually spent (not hoarded)

Measurement checklist:

Multi-layer (server-side + client-side + synthetic)
Sufficient granularity (at least hourly, ideally per-request)
Handles edge cases (what about 500 errors? timeouts?)
Survives component failures (logging doesn't depend on service being up)
Privacy-compliant (no PII in metrics)

Operational checklist:

SLO dashboard accessible to all engineers
Burn rate alerts configured (fast, medium, slow windows)
Error budget policy documented and followed
Incident post-mortems reference SLO impact
Quarterly SLO review (adjust targets based on reality)

Red flags (reassess SLOs):

SLO consistently exceeded by >1% (too loose, or lucky)
SLO consistently missed (too tight, or systemic issue)
Error budget always exhausted (targets unrealistic)
Error budget never spent (targets too loose, or culture problem)
Users complain despite meeting SLOs (measuring wrong thing)

Key Takeaways

SLIs (Service Level Indicators) are the metrics that define service health — request latency, error rate, and availability are the most common
SLOs (Service Level Objectives) set targets for SLIs — e.g., 99.9% of requests complete in under 200ms
Error budgets are the allowed amount of unreliability — a 99.9% SLO means you have 43.8 minutes of downtime budget per month
When error budget is exhausted, freeze feature releases and focus on reliability — this creates a natural balance between velocity and stability
SLAs are contractual commitments with financial penalties — SLOs are internal targets; SLAs should be looser than SLOs to provide a safety margin

Previous Blast Radius and Failure Domain Isolation Up next Multi Region Active Active

Chapter complete!

Up next Multi Region Active Active

Continue

SLIS SLOS && Error Budgets

SLIs, SLOs, and Error Budgets (The Mathematics of Reliability)

Challenge: Your "five 9s" promise is meaningless

Scenario

Interactive question (pause and think)

Progressive reveal (question -> think -> answer)

Real-world analogy (airline on-time performance)

Key insight

Challenge question

Mental model - Error budgets as a reliability currency

Scenario

Interactive question (pause and think)

Progressive reveal

Mental model

Real-world parallel (construction project buffers)

Key insight

Challenge question

What SLI, SLO, SLA actually mean (and why people confuse them)

Scenario

Definitions

Visual: relationship between SLI, SLO, SLA

Interactive question

Progressive reveal

Common confusions

Key insight

Challenge question

Choosing good SLIs - what to measure

Scenario

Interactive question (pause and think)

Progressive reveal

SLI selection framework

SLI categories

Example: Multi-SLI service

Production insight: Google's "The Four Golden Signals"

Key insight

Challenge question

Setting SLO targets - the Goldilocks problem

Scenario

Think about it

Interactive question (pause and think)

Progressive reveal

The cost curve of reliability

SLO target selection framework

Common SLO target mistakes

Key insight

Challenge question

Error budgets - the engine of risk-taking

Scenario

Interactive question (pause and think)

Progressive reveal

Error budget mechanics

Example: Error budget in practice

Burn rate alerting

Production insight: How Netflix uses error budgets

Key insight

Challenge question

Common SLO anti-patterns

Scenario

SLO anti-patterns catalog

Real-world failure: Compound SLO violations

Key insight

Challenge question

Measuring SLIs in practice - implementation patterns

Scenario

Measurement approaches

Measurement Approaches Comparison

Recommended: Multi-layer measurement

SLI calculation methods

Key insight

Challenge question

Final synthesis - Design a complete SLO framework

Synthesis challenge

Your tasks (pause and think)

Progressive reveal (one possible solution)

Key insight

Final challenge question

Appendix: Quick checklist (printable)

Key Takeaways

Course Complete!