System Design: Video Lesson Platform

Requirements

Functional Requirements:

Instructors upload raw video files (up to 10 GB); platform transcodes to multiple qualities and makes available within 20 minutes
Learners stream videos with adaptive bitrate (HLS/DASH), with chapter markers, speed control, and closed captions
Platform tracks per-second watch progress, resume position, and completion status per learner per video
Video search includes full-text search of auto-generated transcripts (search within video)
Instructors add interactive elements: inline quizzes at timestamps, clickable links, and resource attachments
Video engagement analytics: heatmaps of replay/skip behavior, average watch percentage per video

Non-Functional Requirements:

Video start latency under 2 seconds for 95% of viewers globally
Support 1 million concurrent viewers
Transcoding pipeline processes 10,000 new video uploads per day
Captions generated automatically via speech-to-text within 5 minutes of upload
Storage: ~100 PB of video content in multiple resolutions

Scale Estimation

1M concurrent viewers at 720p (2 Mbps average) = 2 Tbps of video egress. This is entirely CDN traffic — no single origin can serve this. With a 98% CDN cache hit rate for popular content, origin egress is ~40 Gbps, manageable across 20 origin servers. Transcoding: 10k new uploads/day at average 30 minutes each = 5k hours of raw video/day. Transcoding to 4 quality levels takes roughly 0.5× real-time on a GPU instance (15 minutes per 30-minute video). 5k hours × 4 qualities × 0.5 = 10k GPU-hours/day, requiring a pool of ~420 GPU instances running continuously. Progress tracking: 1M viewers each sending a heartbeat every 5 seconds = 200k writes/second — requires a write-optimized storage layer.

High-Level Architecture

The platform divides into three pipelines: the upload and processing pipeline, the streaming delivery pipeline, and the engagement tracking pipeline. Upload and processing: instructors upload raw video to a multipart S3 upload endpoint (with client-side chunking for large files). S3 triggers a Lambda to enqueue a transcoding job in SQS. Transcoding workers (EC2 GPU instances running FFmpeg) pull jobs, produce HLS segments at 360p/720p/1080p/4K, generate WebVTT caption files via a speech-to-text API (AWS Transcribe or Whisper), extract a thumbnail grid, and write all outputs back to S3. A job tracker (DynamoDB) reports progress to the instructor's dashboard via polling.

Streaming delivery is a pure CDN problem. HLS manifests and video segments are served from S3 via CloudFront. The manifest (.m3u8) lists segments at each quality, allowing the player to switch quality per segment based on measured bandwidth. Edge caching keeps popular video segments warm globally. For long-tail content (rarely viewed videos), an origin shield in each continent prevents cold-miss cascades from reaching S3. The video player (Video.js or Shaka Player) handles ABR quality selection, DRM (Widevine/FairPlay for premium content), and chapter navigation entirely client-side.

Engagement tracking is a write-heavy append pipeline. Every 5 seconds, the player sends a heartbeat to the progress API: {video_id, user_id, position_seconds, quality}. The API writes to Kafka (durable, low-latency). A Kafka Streams job aggregates per-user completion state (for the resume position feature) and writes every 30 seconds to Redis (for fast dashboard reads) and every 5 minutes to PostgreSQL (durable progress store). A separate ClickHouse pipeline ingests all heartbeat events for engagement analytics — building per-second heatmaps of viewer behavior across the entire learner population.

Core Components

Adaptive Transcoding Service

The transcoding service uses a two-pass architecture. Pass 1 (analysis): FFprobe analyzes the source file for resolution, frame rate, codec, dynamic range, and audio tracks. This determines the optimal output profiles (e.g., a 480p source is not upscaled to 1080p). Pass 2 (transcoding): FFmpeg runs with H.264 (for broad compatibility) and H.265/AV1 (for bandwidth-efficient streaming) codecs, producing HLS segments of exactly 6 seconds each (aligned to keyframes). The HLS master playlist references all quality variants. GPU instances use NVENC hardware encoding, achieving 10-20× real-time transcoding speed. A priority queue ensures live-event uploads (webinars) jump ahead of regular uploads.

Caption and Transcript Service

After transcoding, the audio track is extracted and sent to a speech-to-text service (AWS Transcribe, Google STT, or a self-hosted Whisper model). The returned transcript with word-level timestamps is stored as a WebVTT file (for browser caption rendering) and indexed in Elasticsearch (for in-video search). Full-text search queries are handled by Elasticsearch — a query like "explain gradient descent" returns a list of videos with timestamps where those words appear in the transcript. Result snippets show the transcript text surrounding the match with a direct link to that timestamp.

Interactive Elements Service

Instructors add interactive overlays via a timeline editor in the authoring UI. Interactive elements are stored as time-coded metadata objects in PostgreSQL: (video_id, timestamp_ms, type, payload_json). Types include: quiz (an inline question that pauses the video), link (a clickable annotation), chapter (a named navigation marker), and resource (a file attachment to download). The video player fetches these metadata objects at play start and uses a cue point system to trigger overlays at the correct playback position. Quiz responses are captured and stored in the same progress tracking pipeline as regular answers.

Database Design

PostgreSQL: videos (video_id, instructor_id, title, description, duration_seconds, status, upload_at), video_assets (asset_id, video_id, type[hls_manifest/segment/thumbnail/caption], s3_key, quality), interactive_elements (element_id, video_id, timestamp_ms, type, payload_json), watch_progress (progress_id, video_id, user_id, position_seconds, completed, last_watched_at) — indexed on (user_id, video_id). ClickHouse: watch_events (video_id, user_id, position_seconds, quality, client_ts, server_ts) — used for heatmap analytics, partitioned by video_id. Elasticsearch: video_transcripts (video_id, transcript segments with word timestamps) — for full-text in-video search.

API Design

POST /videos/upload-url — body: {filename, size_bytes, content_type}, returns a multipart S3 pre-signed upload URL and video_id; upload goes directly to S3 bypassing the server
GET /videos/{video_id}/manifest — returns CloudFront-signed HLS manifest URL (valid 4 hours, DRM token embedded); the player streams directly from CDN
POST /videos/{video_id}/progress — body: {position_seconds, quality}, writes heartbeat to Kafka; rate-limited to 1 call/5 seconds per user
GET /videos/search?q={query} — full-text search of transcripts, returns {video_id, title, matches: [{timestamp_ms, snippet}]}

Scaling & Bottlenecks

The 200k progress writes/second is the dominant write bottleneck. Writing directly to PostgreSQL at this rate requires 200 write nodes — impractical. The Kafka buffer + batch-write approach reduces PostgreSQL writes to ~1k/second (batch writes every 30 seconds). Redis caches the latest position for each (user_id, video_id) pair for fast resume-position reads, updated on every Kafka event by the stream processor. Redis memory: 1M active viewers × 100 bytes/entry = 100 MB, trivial.

CDN cold start for new viral content is the latency risk: a new video that suddenly gets 100k concurrent viewers will have low CDN cache hit rates in the first few minutes, causing all requests to hit origin S3. CloudFront's origin shield mitigates this by funneling cache-miss requests to a regional PoP before hitting S3, but popular new content should be pre-warmed by triggering CDN cache fill requests across all edge nodes at publish time.

Key Trade-offs

HLS vs. DASH: HLS has native support in Safari/iOS without JavaScript; DASH is more flexible (e.g., multiple audio tracks, trickplay thumbnails) but requires a JS player. Using HLS for wide compatibility with DASH for premium features is a reasonable hybrid.
Whisper (self-hosted) vs. managed STT API: Managed APIs (AWS Transcribe) are operationally simpler and scale automatically but cost 10× more per minute of audio; self-hosted Whisper on GPU instances is cheaper at scale but requires ML infrastructure management.
Per-second progress tracking vs. per-session: Per-second heartbeats enable accurate engagement heatmaps but generate 200k writes/second; per-session (write on pause/close) loses granularity and misses tab-crash data. Batching client-side (accumulate 30 seconds, flush) is a good middle ground.
CDN-signed URLs vs. token auth: Signed CDN URLs prevent hotlinking without requiring the player to re-authenticate, but URL expiry creates edge cases if a learner pauses for hours and the URL expires mid-session; a 24-hour TTL with background refresh handles this gracefully.