SLOs, SLIs, and SLAs Explained: Measuring and Guaranteeing Reliability
The difference between SLOs, SLIs, and SLAs — how to define reliability targets, measure them with error budgets, and use them in system design interviews.
SLOs, SLIs, and SLAs
SLIs (Service Level Indicators) are metrics that measure service reliability. SLOs (Service Level Objectives) are targets for those metrics. SLAs (Service Level Agreements) are contractual commitments with consequences for missing targets.
What It Really Means
Every system fails. The question is not "will it fail?" but "how much failure is acceptable?" SLOs, SLIs, and SLAs provide a framework for answering this question with data instead of gut feelings.
SLI (Service Level Indicator): A quantitative measure of a specific aspect of service reliability. "What are we measuring?" Examples: request latency, error rate, availability, throughput.
SLO (Service Level Objective): A target value or range for an SLI. "What is our goal?" Examples: p99 latency < 200ms, availability > 99.9%, error rate < 0.1%.
SLA (Service Level Agreement): A contract with customers that specifies SLOs and the consequences (usually financial) of missing them. "What did we promise?" Examples: "99.9% uptime or we credit your account."
The relationship is: SLIs are measured, SLOs are targeted, and SLAs are promised. Your SLOs should be stricter than your SLAs (internal target of 99.95% availability when you promise 99.9% to customers), giving you a buffer.
How It Works in Practice
Defining Good SLIs
Availability in Nines
Error Budgets
An error budget is the inverse of the SLO — how much unreliability is acceptable:
Implementation
Calculating SLI from Prometheus metrics:
SLO monitoring dashboard (conceptual):
Trade-offs
Setting SLO targets:
- Too strict (99.99%): Slows down feature development, expensive to maintain, may be unnecessary for your users
- Too lenient (99%): Users experience too many errors, competitive disadvantage
- Just right: Based on user expectations and business requirements
SLO vs SLA:
- SLOs should be stricter than SLAs (internal 99.95% when SLA promises 99.9%)
- The gap is your safety margin for catching issues before they affect SLA compliance
- SLAs have financial penalties; SLOs have engineering consequences (freeze deployments)
Multiple SLOs per service:
- Different SLOs for different request types (reads vs writes)
- Different SLOs for different user tiers (free vs paid)
- Different SLOs for different regions
Common Misconceptions
- "100% availability is the goal" — 100% availability is neither achievable nor desirable. Every additional nine costs exponentially more. 99.999% costs 100x more than 99.9% to maintain.
- "SLOs are just for ops teams" — SLOs should drive product decisions. If error budget is exhausted, product teams should delay features and invest in reliability.
- "Availability is the only SLI that matters" — A service that returns 200 OK in 30 seconds is technically "available" but unusable. Latency, error rate, and throughput SLIs are equally important.
- "SLAs and SLOs are the same thing" — SLOs are internal engineering targets. SLAs are external contractual obligations. You can (and should) have SLOs without SLAs.
- "You should alert on every SLO violation" — Brief dips below SLO are normal. Alert on error budget burn rate ("at this rate, we will exhaust the budget in 2 hours") rather than instantaneous violations.
How This Appears in Interviews
- "How do you define reliability for your system?" — Define SLIs based on user-facing metrics, set SLOs based on user expectations, implement error budget tracking.
- "What availability target would you set for a payment system?" — 99.99% (4.38 minutes downtime/month). Justify by the financial impact of downtime.
- "How do you balance feature velocity with reliability?" — Error budgets. When budget is healthy, ship fast. When budget is low, focus on reliability.
- "Design a monitoring system" — SLI-based dashboards, error budget alerting, burn-rate alerts, automated rollback on budget consumption.
Related Concepts
- Tail Latency — latency percentiles are core SLIs
- Chaos Engineering — validates that SLOs hold under failure
- Blue-Green vs Canary Deployments — deployment strategies that protect SLOs
- Back-of-Envelope Estimation — estimate whether SLOs are achievable
- Time-Series Data Modeling — storing SLI metrics over time
- System Design Interview Guide
- Algoroq Pricing — access all concept deep-dives
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.