Courses 0%
48
Reliability And Resilience · Chapter 48 of 51

SLIS SLOS && Error Budgets

Akhil
Akhil Sharma
20 min

SLIs, SLOs, and Error Budgets (The Mathematics of Reliability)

Audience: SRE teams, product managers, and engineering leaders defining and measuring service reliability.

This article assumes:

  • Perfect reliability (100% uptime) is impossible and economically irrational.
  • Users care about their experience, not your internal metrics.
  • Reliability work competes with feature development for engineering time.
  • What you measure shapes what you optimize (Goodhart's Law applies).

Challenge: Your "five 9s" promise is meaningless

Scenario

Your sales team promises customers "99.999% uptime" (five 9s).

Then reality hits:

  • Your monitoring shows 99.999% uptime ✓
  • But customers complain the service is "always down"
  • Your CEO asks: "Why are customers angry if we're meeting our SLA?"

What went wrong?

Interactive question (pause and think)

Which is the most likely culprit?

  1. Customers are wrong (service was actually up)
  2. Monitoring measures the wrong thing
  3. 99.999% doesn't mean what sales thinks it means
  4. All of the above

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (2) and (3).

Classic failures:

  • Monitoring checks health endpoint (returns 200), but payment processing is broken
  • "Uptime" measured as "server responds" not "user can complete their task"
  • 99.999% allows 5 minutes downtime/year, but if it all happens during Black Friday...

Real-world analogy (airline on-time performance)

An airline claims "99% on-time departure."

But what if:

  • They measure "door closed on time" not "wheels up on time"
  • Flights delayed by 59 minutes count as "on time" (60 min = late)
  • They only measure the specific route you don't fly

The metric is technically accurate but completely misleading.

Key insight

SLIs (Service Level Indicators) must measure what users actually care about, not what's easy to measure.

Challenge question

If you could only measure ONE metric to represent reliability, what would it be?


Mental model - Error budgets as a reliability currency

Scenario

Your product team wants to ship a new feature. Your SRE team says "We can't deploy - we're out of error budget."

Product team: "What? We have a deadline!"

SRE team: "We agreed to 99.9% uptime. We've already had 99.85% this month. We're out of budget."

Interactive question (pause and think)

Who's right?

A. Product team (business needs matter more than arbitrary targets) B. SRE team (reliability is non-negotiable) C. Both (they need a better framework for trade-offs)

Progressive reveal

Answer: C.

This is precisely what error budgets solve: a shared framework for reliability vs velocity trade-offs.

Mental model

Think of error budgets as:

  • Financial budget: You have $X to spend; spend wisely
  • Risk budget: You can afford Y failures; don't waste them
  • Innovation currency: Reliability headroom enables experimentation

The goal: Make reliability a measurable trade-off, not a religious war.

Real-world parallel (construction project buffers)

A building project has a time buffer. If weather delays use up the buffer, the project stops adding risky features (complex facades, custom materials). They focus on finishing reliably.

If the buffer is intact, they can afford some experimentation.

Key insight

Error budgets formalize the trade-off: reliability is not "maximize uptime" but "maintain agreed reliability while moving fast."

Challenge question

If you have 99.9% SLO (43 minutes downtime/month), should you spend your error budget on:

  • Testing risky new features in production
  • Planned maintenance windows
  • Neither (save it for unexpected failures)

What SLI, SLO, SLA actually mean (and why people confuse them)

Scenario

Your team debates reliability targets. Three acronyms get thrown around interchangeably.

Let's clarify.

Definitions

SLI (Service Level Indicator):

  • What it is: A quantitative measure of service behavior
  • Examples: Request success rate, latency percentile, throughput
  • Key point: The metric itself, not the target

SLO (Service Level Objective):

  • What it is: A target value or range for an SLI
  • Examples: "99.9% of requests succeed", "p99 latency < 500ms"
  • Key point: Internal goal, sets expectations

SLA (Service Level Agreement):

  • What it is: A business contract with consequences for missing SLO
  • Examples: "99.9% uptime or credits", "p99 < 500ms or refund"
  • Key point: Legal/financial commitment

Visual: relationship between SLI, SLO, SLA

text

Interactive question

Your SLI shows 99.6% uptime. Your SLO is 99.9%. Your SLA is 99.5%.

What happens?

A. You violated the SLA (customer gets refund) B. You missed the SLO (internal alarm) C. Nothing (still above SLA) D. B and C

Progressive reveal

Answer: D.

  • Missed SLO → internal response (freeze risky changes, focus on reliability)
  • Still met SLA → no customer impact, no refunds
  • The buffer between SLO and SLA is intentional

Common confusions

Mistake 1: SLO = SLA

  • Wrong: "Our SLO is our customer promise"
  • Right: "Our SLA is our customer promise; our SLO is stricter to give us early warning"

Mistake 2: More nines = better

  • Wrong: "99.999% is better than 99.9%"
  • Right: "99.999% costs exponentially more and may not improve user experience"

Mistake 3: Uptime is the only SLI that matters

  • Wrong: "We have 99.9% uptime so we're reliable"
  • Right: "Uptime doesn't measure latency, data correctness, or user-perceived availability"

Key insight

SLI is the thermometer. SLO is the target temperature. SLA is the contract saying "we'll keep it above freezing."

Challenge question

Should your SLO be tighter than your SLA? By how much? What's the cost of the buffer?


Choosing good SLIs - what to measure

Scenario

You're defining SLIs for an e-commerce checkout service.

Bad SLI: "Server CPU < 80%"

  • Users don't care about your CPU
  • CPU can be fine while checkout is broken

Good SLI: "Checkout success rate > 99.9%"

  • Directly maps to user experience

Interactive question (pause and think)

For a payment processing service, which SLI is most valuable?

A. Server uptime percentage B. Request success rate (2xx HTTP responses) C. Payment authorization success rate D. Database query latency

Progressive reveal

Answer: C.

  • A: Servers can be "up" while payments fail
  • B: 200 OK doesn't mean payment succeeded (could be 200 {"error": "declined"})
  • C: Directly measures user intent
  • D: Internal metric, doesn't map to user experience

SLI selection framework

The "user-centric" test: Ask: "Would a user notice if this SLI degraded?"

✓ Good SLIs (user-visible):

  • Request success rate (from user's perspective)
  • Request latency (time to complete user action)
  • Data freshness (stale data = bad UX)
  • Throughput (can system handle user load)

✗ Bad SLIs (invisible to users):

  • CPU utilization (infrastructure metric)
  • Memory usage (internal resource)
  • Log error rate (unless it correlates with user impact)

The "actionable" test: Ask: "If this SLI degrades, can we take meaningful action?"

✓ Good: "Checkout success rate dropped to 95%" → investigate payment gateway ✗ Bad: "System health score dropped to 80%" → what does that even mean?

SLI categories

text

Example: Multi-SLI service

yaml

Production insight: Google's "The Four Golden Signals"

Google SRE recommends focusing on:

  1. Latency: How long requests take
  2. Traffic: How much demand
  3. Errors: Rate of failed requests
  4. Saturation: How "full" the service is

Start simple. Add complexity only when needed.

Key insight

The best SLI is one that, when it degrades, a user has a bad experience. If users don't care, it's not an SLI - it's an internal metric.

Challenge question

For a video streaming service, should you have separate SLIs for "start time" and "buffering rate" or combine them into one SLI?


Setting SLO targets - the Goldilocks problem

Scenario

You need to set an SLO target for your API.

Too high (99.999%): Engineers burn out chasing impossible reliability. Costs skyrocket. Too low (95%): Customers churn. Product team can't sell.

How do you find the right target?

Think about it

What's the cost of each additional "9" of reliability?

Interactive question (pause and think)

Your current reliability is 99.5%. Customers are complaining. What's your next SLO target?

A. 99.99% (skip ahead, be the best) B. 99.9% (incremental improvement) C. 99.5% (match current reality, then improve)

Progressive reveal

Answer: Usually B, sometimes C.

Why not A? Going from 99.5% → 99.99% is ~10x harder than 99.5% → 99.9%. Start with achievable targets.

The cost curve of reliability

text

SLO target selection framework

Step 1: Measure current performance (4 weeks)

Step 2: Talk to users

Step 3: Analyze impact of outages

Step 4: Set targets with buffer

Step 5: Calculate error budget

Common SLO target mistakes

Mistake 1: Matching competitor SLOs blindly

  • Wrong: "AWS promises 99.99%, we should too"
  • Right: "Our users need 99.9%, and we can deliver it reliably"

Mistake 2: One-size-fits-all SLOs

  • Wrong: "All our services have 99.9% uptime SLO"
  • Right: "Critical: 99.9%, important: 99%, nice-to-have: 95%"

Mistake 3: Aspirational SLOs

  • Wrong: "We're at 99%, let's set SLO to 99.99% to motivate the team"
  • Right: "We're at 99%, let's set SLO to 99.5% and actually achieve it, then increase"

Mistake 4: Forgetting dependencies

  • Wrong: "We'll be 99.99% reliable" (but your database vendor only promises 99.9%)
  • Right: "Our SLO must account for dependency SLOs" (can't exceed weakest link)

Key insight

Your SLO should be: (1) Achievable with current architecture, (2) Meaningful to users, (3) Tighter than your SLA. If it's not all three, revisit.

Challenge question

Your SLO is 99.9%, but you're consistently achieving 99.95%. Should you tighten your SLO or keep the buffer?


Error budgets - the engine of risk-taking

Scenario

Month starts:

  • SLO: 99.9% success rate
  • Error budget: 43 minutes of downtime allowed

By mid-month:

  • 30 minutes of downtime already burned (bad deployment)
  • 13 minutes remaining

Product team wants to ship a major feature.

Interactive question (pause and think)

What should you do?

A. Ship anyway (business pressure) B. Freeze deploys until next month (protect SLO) C. Ship only low-risk changes (burn budget carefully) D. Revise SLO downward (adjust expectations)

Progressive reveal

Answer: C, with team agreement.

Error budgets are meant to be spent, not hoarded. But spend wisely.

Error budget mechanics

Definition:

Consumption tracking:

Policy framework:

yaml

Example: Error budget in practice

go

Burn rate alerting

text

Production insight: How Netflix uses error budgets

Netflix:

  • Each service has error budget
  • Teams can spend budget on experiments (chaos, canary, A/B tests)
  • When budget low, automated policies kick in (deploy freeze)
  • Incentivizes reliability: more reliable = more freedom to innovate

Key insight

Error budgets convert reliability from a vague goal into a concrete currency. You can spend it on innovation or hoard it. Choose wisely.

Challenge question

Your error budget is 43 min/month. Should you spend it all on: (1) one big risky feature launch, or (2) many small experiments? What's the trade-off?


Common SLO anti-patterns

Scenario

You've set SLOs. But things feel off. Users are still unhappy. Engineers are frustrated.

SLO anti-patterns catalog

Anti-pattern 1: Vanity SLOs

yaml

Anti-pattern 2: Invisible SLOs

yaml

Anti-pattern 3: Inside-out SLOs

yaml

Anti-pattern 4: One-size-fits-all

yaml

Anti-pattern 5: No SLO enforcement

yaml

Anti-pattern 6: Missing user-journey SLOs

yaml

Real-world failure: Compound SLO violations

text

Key insight

Bad SLOs are worse than no SLOs. They create false confidence and misallocate engineering effort.

Challenge question

If your SLO is consistently exceeded (always 99.99% when target is 99.9%), is that good or bad?


Measuring SLIs in practice - implementation patterns

Scenario

You've defined your SLIs. Now you need to actually measure them.

Where do you measure? How often? What's the source of truth?

Measurement approaches

Client-side measurement:

javascript

Pros:

  • Sees actual user experience (network latency, client-side rendering)
  • Accounts for CDN, geographic distribution

Cons:

  • Can't track requests that never arrive (network failures)
  • Sampling bias (only successful page loads report)
  • User privacy concerns

Server-side measurement:

go

Pros:

  • Complete picture (all requests, including failures)
  • No sampling bias
  • Easier to implement at scale

Cons:

  • Doesn't see client-side latency (network, rendering)
  • May miss user-perceived failures (e.g., JS errors)

Synthetic monitoring (probes):

yaml

Pros:

  • Detects issues before users do
  • Consistent baseline (not affected by traffic patterns)
  • Works even when no user traffic

Cons:

  • May not match real user behavior
  • Can't measure user-specific features (auth, personalization)
  • Creates artificial load

Measurement Approaches Comparison

ApproachSees real user experienceComplete coverageDetects issues proactivelyBest for
Client-side (RUM)YesNo (only successful loads)NoUser-facing latency, CWV
Server-sideNo (misses network/render)Yes (all requests)NoAPI reliability, error rates
Synthetic monitoringNo (artificial)No (sampled)YesUptime monitoring, multi-region

Recommended: Multi-layer measurement

text

SLI calculation methods

Method 1: Request-based (most common)

sql

Method 2: Windows-based (for batch/streaming)

python

Method 3: Burn rate (for error budgets)

go

Key insight

SLI measurement is not "set it and forget it." Continuously validate: does this metric reflect user experience? If not, adjust.

Challenge question

Your server-side SLI shows 99.9% success rate, but your client-side RUM shows 95% success rate. What's happening?


Final synthesis - Design a complete SLO framework

Synthesis challenge

You're the SRE lead for a ride-sharing platform.

Requirements:

  • 50 million rides/month
  • Mobile app (iOS + Android)
  • Backend services: ride matching, payments, notifications, maps
  • Users care about: finding a ride quickly, accurate pricing, reliable payment

Current state:

  • No SLOs defined
  • Monitoring exists but ad-hoc
  • Incidents happen but no clear severity criteria
  • Engineers debate "is this reliable enough?" constantly

Your tasks (pause and think)

  1. Define 3-5 key SLIs (what to measure?)
  2. Set SLO targets for each SLI (what's good enough?)
  3. Calculate error budgets (how much failure is acceptable?)
  4. Design error budget policy (what happens when budget runs low?)
  5. Implement measurement strategy (how to track?)
  6. Define escalation triggers (when to page someone?)

Write down your framework.

Progressive reveal (one possible solution)

1. Key SLIs:

yaml

2. SLO targets:

yaml

3. Error budgets:

yaml

4. Error budget policy:

yaml

5. Measurement strategy:

yaml

6. Escalation triggers:

yaml

Key insight

A complete SLO framework turns reliability from art to science: measurable, predictable, and blameless.

Final challenge question

Your SLOs are met (99.9% success rate), but your customer NPS (Net Promoter Score) is dropping. What's wrong with your SLO framework?


Appendix: Quick checklist (printable)

SLI definition checklist:

  • Measures something users care about (not just infrastructure)
  • Quantifiable (can calculate a percentage or latency)
  • Actionable (degradation leads to clear debugging)
  • Fresh enough (measured with acceptable latency)
  • Covers the right scope (end-to-end, not just component)

SLO target checklist:

  • Based on current performance (not aspirational)
  • Informed by user research (what do users tolerate?)
  • Tighter than SLA (buffer for early warning)
  • Achievable with current architecture
  • Different for different criticality features

Error budget checklist:

  • Calculated from SLO (formula applied correctly)
  • Tracked in real-time (dashboard + alerts)
  • Policy defined (what happens at 50%, 25%, 0%?)
  • Team agrees on policy (not imposed top-down)
  • Budget is actually spent (not hoarded)

Measurement checklist:

  • Multi-layer (server-side + client-side + synthetic)
  • Sufficient granularity (at least hourly, ideally per-request)
  • Handles edge cases (what about 500 errors? timeouts?)
  • Survives component failures (logging doesn't depend on service being up)
  • Privacy-compliant (no PII in metrics)

Operational checklist:

  • SLO dashboard accessible to all engineers
  • Burn rate alerts configured (fast, medium, slow windows)
  • Error budget policy documented and followed
  • Incident post-mortems reference SLO impact
  • Quarterly SLO review (adjust targets based on reality)

Red flags (reassess SLOs):

  • SLO consistently exceeded by >1% (too loose, or lucky)
  • SLO consistently missed (too tight, or systemic issue)
  • Error budget always exhausted (targets unrealistic)
  • Error budget never spent (targets too loose, or culture problem)
  • Users complain despite meeting SLOs (measuring wrong thing)

Key Takeaways

  1. SLIs (Service Level Indicators) are the metrics that define service health — request latency, error rate, and availability are the most common
  2. SLOs (Service Level Objectives) set targets for SLIs — e.g., 99.9% of requests complete in under 200ms
  3. Error budgets are the allowed amount of unreliability — a 99.9% SLO means you have 43.8 minutes of downtime budget per month
  4. When error budget is exhausted, freeze feature releases and focus on reliability — this creates a natural balance between velocity and stability
  5. SLAs are contractual commitments with financial penalties — SLOs are internal targets; SLAs should be looser than SLOs to provide a safety margin
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next Multi Region Active Active
Continue