The difference between SLOs, SLIs, and SLAs — how to define reliability targets, measure them with error budgets, and use them in system design interviews.

SLOs, SLIs, and SLAs

SLIs (Service Level Indicators) are metrics that measure service reliability. SLOs (Service Level Objectives) are targets for those metrics. SLAs (Service Level Agreements) are contractual commitments with consequences for missing targets.

What It Really Means

Every system fails. The question is not "will it fail?" but "how much failure is acceptable?" SLOs, SLIs, and SLAs provide a framework for answering this question with data instead of gut feelings.

SLI (Service Level Indicator): A quantitative measure of a specific aspect of service reliability. "What are we measuring?" Examples: request latency, error rate, availability, throughput.

SLO (Service Level Objective): A target value or range for an SLI. "What is our goal?" Examples: p99 latency < 200ms, availability > 99.9%, error rate < 0.1%.

SLA (Service Level Agreement): A contract with customers that specifies SLOs and the consequences (usually financial) of missing them. "What did we promise?" Examples: "99.9% uptime or we credit your account."

The relationship is: SLIs are measured, SLOs are targeted, and SLAs are promised. Your SLOs should be stricter than your SLAs (internal target of 99.95% availability when you promise 99.9% to customers), giving you a buffer.

How It Works in Practice

Defining Good SLIs

Availability in Nines

Error Budgets

An error budget is the inverse of the SLO — how much unreliability is acceptable:

Implementation

Calculating SLI from Prometheus metrics:

yaml

SLO monitoring dashboard (conceptual):

python

Trade-offs

Setting SLO targets:

Too strict (99.99%): Slows down feature development, expensive to maintain, may be unnecessary for your users
Too lenient (99%): Users experience too many errors, competitive disadvantage
Just right: Based on user expectations and business requirements

SLO vs SLA:

SLOs should be stricter than SLAs (internal 99.95% when SLA promises 99.9%)
The gap is your safety margin for catching issues before they affect SLA compliance
SLAs have financial penalties; SLOs have engineering consequences (freeze deployments)

Multiple SLOs per service:

Different SLOs for different request types (reads vs writes)
Different SLOs for different user tiers (free vs paid)
Different SLOs for different regions

Common Misconceptions

"100% availability is the goal" — 100% availability is neither achievable nor desirable. Every additional nine costs exponentially more. 99.999% costs 100x more than 99.9% to maintain.
"SLOs are just for ops teams" — SLOs should drive product decisions. If error budget is exhausted, product teams should delay features and invest in reliability.
"Availability is the only SLI that matters" — A service that returns 200 OK in 30 seconds is technically "available" but unusable. Latency, error rate, and throughput SLIs are equally important.
"SLAs and SLOs are the same thing" — SLOs are internal engineering targets. SLAs are external contractual obligations. You can (and should) have SLOs without SLAs.
"You should alert on every SLO violation" — Brief dips below SLO are normal. Alert on error budget burn rate ("at this rate, we will exhaust the budget in 2 hours") rather than instantaneous violations.

How This Appears in Interviews

"How do you define reliability for your system?" — Define SLIs based on user-facing metrics, set SLOs based on user expectations, implement error budget tracking.
"What availability target would you set for a payment system?" — 99.99% (4.38 minutes downtime/month). Justify by the financial impact of downtime.
"How do you balance feature velocity with reliability?" — Error budgets. When budget is healthy, ship fast. When budget is low, focus on reliability.
"Design a monitoring system" — SLI-based dashboards, error budget alerting, burn-rate alerts, automated rollback on budget consumption.

Related Concepts

Tail Latency — latency percentiles are core SLIs
Chaos Engineering — validates that SLOs hold under failure
Blue-Green vs Canary Deployments — deployment strategies that protect SLOs
Back-of-Envelope Estimation — estimate whether SLOs are achievable
Time-Series Data Modeling — storing SLI metrics over time
System Design Interview Guide
Algoroq Pricing — access all concept deep-dives

SLOs, SLIs, and SLAs Explained: Measuring and Guaranteeing Reliability

SLOs, SLIs, and SLAs

What It Really Means

How It Works in Practice

Defining Good SLIs

Availability in Nines

Error Budgets

Implementation

Trade-offs

Common Misconceptions

How This Appears in Interviews

Related Concepts

Learn from senior engineers in our 12-week cohort

Hallucination in LLMs Explained: Why AI Models Make Things Up

Tail Latency Explained: Why P99 Matters More Than Average Response Time

Chaos Engineering Explained: Breaking Systems to Make Them Stronger

Idempotency Explained: Designing Safe Retries in Distributed Systems

CAP Theorem Explained: Consistency, Availability, and Partition Tolerance

RAG Explained: Retrieval-Augmented Generation for LLM Applications