System Design: Spotify Music Streaming

Requirements

Functional Requirements:

Stream audio tracks in multiple quality levels (96 Kbps to lossless 1411 Kbps) with seamless gapless playback
Personalized recommendations: Discover Weekly, Daily Mixes, and real-time radio based on listening history
Collaborative playlists, social following, and shared listening sessions (Spotify Jam)
Offline mode: download tracks for playback without internet connectivity
Podcasts and audiobooks with chapter navigation and variable playback speed
Real-time "what's playing" status visible to followers

Non-Functional Requirements:

500 million users, 200 million daily active
Audio stream start latency under 200ms
Recommendation model trained on 50 billion daily listening events
99.99% uptime for audio streaming
Global catalog of 100 million tracks, 5 million podcasts

Scale Estimation

200M DAU each listening ~30 minutes/day at 160 Kbps (standard quality) = 200M × 30min × 160 Kbps = 9.6 Tbps of audio egress. This is almost entirely CDN traffic. At a 95% CDN hit rate, origin egress is ~480 Gbps — served by ~48 origin servers at 10 Gbps each. Track catalog: 100M tracks × 5 minutes average × 5 quality variants × 1 MB/minute = 2.5 PB of audio storage in S3. Recommendations: 200M DAU generating 50B listening events/day = 578k events/second — ingested via Kafka for real-time model updates and batch model training.

High-Level Architecture

The platform divides into four planes: the catalog plane (track metadata, audio files, licensing), the streaming plane (audio delivery, CDN, offline sync), the social and discovery plane (playlists, following, recommendations), and the analytics plane (listening history, ML pipelines).

Catalog: tracks are uploaded by labels/distributors, transcoded to OGG Vorbis (Spotify's primary format) at 5 quality levels, and stored in S3. A CDN (Akamai and Cloudflare in tandem) caches audio segments globally. Track metadata (title, artist, album, duration, ISRC) is stored in PostgreSQL; lyrics sync data in a separate service. Licensing data (which tracks are available in which countries) is cached in Redis per user session.

Streaming: when a user plays a track, the client requests audio from the CDN directly. For premium users, the client prefetches the next 2-3 tracks in the queue while the current track plays (reducing perceived latency for track changes to zero). For free users, ad insertion is handled by a server-side ad stitching service that dynamically inserts audio ads into the stream. Offline downloads: tracks are encrypted with a per-device key (DRM via Widevine) and stored in the device's local file system. Playback requires an active premium subscription verified at session start.

Recommendations: Spotify's recommendation engine is the platform's core differentiator. Collaborative filtering (matrix factorization on the 500M-user × 100M-track listening matrix) generates user embeddings and track embeddings. Cosine similarity between user embedding and track embeddings identifies recommended tracks. The embedding models (deployed via TorchServe) are queried at low latency to power real-time radio (next track prediction) and are batched-computed weekly for Discover Weekly playlist generation.

Core Components

Audio Transcoding Pipeline

New tracks from label uploads go through an automated quality control and transcoding pipeline. FFmpeg produces: OGG Vorbis at 24 Kbps (lowest), 96 Kbps, 160 Kbps (standard), 320 Kbps (high), and FLAC (lossless) variants. Each variant is split into 30-second segments (for CDN caching efficiency and random access within a track). A loudness normalization pass (EBU R 128 standard) ensures consistent perceived volume across tracks. Transcoding jobs run on a Spot instance fleet, processing ~500k new tracks/day. The transcoded segments are stored in S3 with public-read access (protected by signed URLs for premium-only qualities).

Recommendation Engine

The recommendation engine uses a two-stage approach: retrieval (find candidate tracks) and ranking (sort candidates by predicted user preference). Retrieval uses Approximate Nearest Neighbor search (FAISS index) over track embeddings — given a user's embedding vector, find the 500 nearest tracks in embedding space in <10ms. Ranking applies a neural network (MLP with user context features: time of day, device type, recent listening history) to score the 500 candidates and return the top 20. The user embedding is updated in real time using an online learning signal (positive: listened >30 seconds; negative: skipped in first 10 seconds), stored in Redis and merged with the batch-trained embedding nightly.

Collaborative Playlist Service

Collaborative playlists allow multiple users to add/remove/reorder tracks. Concurrency is handled with operational transform: each edit is a delta (ADD track at index, REMOVE track at index, MOVE track from index A to B). The service applies deltas in server-assigned timestamp order and broadcasts the result to all connected editors via WebSocket. Playlist state is stored in Redis (active editing sessions) and PostgreSQL (authoritative, versioned). For shared listening sessions (Spotify Jam), one user acts as the "host" and all participants follow the host's playback position in real time. The Jam service uses the same WebSocket infrastructure as collaborative playlists but also synchronizes playback state (current track, position, play/pause) across all devices.

Database Design

PostgreSQL: tracks (track_id, isrc, title, artist_ids[], album_id, duration_ms, release_date, explicit), playlists (playlist_id, owner_id, name, is_public, is_collaborative, track_ids[]), user_library (user_id, liked_tracks[], followed_artists[], saved_albums[], followed_playlists[]). Redis Cluster: user:{user_id}:queue (current playback queue), user:{user_id}:embedding (latest recommendation vector, 128-dim float array), track:{track_id}:metadata (cached metadata, TTL 1h), streaming:{user_id}:state (current track, position, device_id for cross-device handoff). Cassandra: listening_history (user_id, track_id, listened_at, listen_duration_ms, source) — wide-column, optimized for append and time-range reads. S3: audio segments (CDN-backed), album artwork, podcast media. Kafka: listening-events (real-time listening data for recommendations and analytics).

API Design

GET /tracks/{track_id}/stream — returns signed CDN URLs for audio segments at requested quality; checks licensing for user's country; premium-only for lossless
GET /recommendations/radio?seed_track_ids[]={...}&seed_artist_ids[]={...} — returns 20 recommended tracks using embedding similarity; <100ms response
POST /playlists/{playlist_id}/tracks — body: {track_id, position}, adds track to playlist; broadcasts delta to collaborative editors via WebSocket
GET /users/{user_id}/top-tracks?period={short|medium|long} — returns listening history statistics; popular feature for Spotify Wrapped
POST /jam/sessions — body: {host_id, initial_queue}, creates Jam session, returns session_id and WebSocket endpoint

Scaling & Bottlenecks

CDN audio delivery is the dominant scaling concern. At 9.6 Tbps aggregate egress, Spotify relies on a multi-CDN strategy (Akamai + Cloudflare + GCP CDN) to distribute load and ensure geo-redundancy. Popular tracks (top 1% account for ~80% of streams) have near-100% CDN hit rates; long-tail tracks may have 0% CDN cache after initial popularity fades, requiring efficient origin serving (S3 with Transfer Acceleration for cross-region delivery).

Recommendation latency: ANN search over 100M track embeddings (128 dimensions, FAISS HNSW index) takes ~5ms per query. With 200M DAU generating 600M track starts/day = 7k recommendation calls/second, a cluster of 50 TorchServe instances handles the load. The FAISS index (100M × 128 × 4 bytes = 51 GB) fits in memory on a large instance — served from a read-replicated embedding cache (not reloaded per request).

Key Trade-offs

OGG Vorbis vs. AAC/MP3: Spotify uses OGG Vorbis (royalty-free, good quality at low bitrates) as its primary format; AAC is better supported on Apple hardware (no transcoding needed on iOS/macOS); serving multiple formats doubles storage cost but reduces client-side transcoding. Spotify historically prioritized OGG for cost reasons.
Batch recommendations (Discover Weekly) vs. real-time radio: Weekly batch recommendations (Discover Weekly, Release Radar) allow rich, computationally expensive models to run on the full user graph; real-time radio requires sub-100ms inference, forcing simpler approximate models. The combination of both covers different discovery use cases.
DRM vs. no DRM for offline: DRM (Widevine/FairPlay) protects label licensing requirements but adds client complexity and creates playback failures when subscription lapses; no DRM simplifies implementation but would require individual track licensing agreements (impractical at scale).
Centralized vs. P2P audio delivery: Spotify experimented with peer-to-peer delivery (Spotify P2P, discontinued in 2014) to reduce CDN costs; P2P reduces egress costs but introduces variable latency, NAT traversal complexity, and privacy concerns. CDN is now universally preferred in the industry.