SYSTEM_DESIGN

System Design: Video Upload & Processing Service

System design of a video upload and processing service covering resumable uploads, virus scanning, content moderation, metadata extraction, and reliable processing for millions of daily uploads.

17 min readUpdated Jan 15, 2025
system-designvideo-uploadprocessings3content-moderation

Requirements

Functional Requirements:

  • Accept video uploads up to 10GB in size via resumable, chunked upload protocol
  • Validate video format, codec, and integrity before processing
  • Scan uploaded content for malware and prohibited material
  • Extract metadata (duration, resolution, codec, bitrate, audio tracks, subtitles)
  • Generate preview thumbnails at multiple timestamps
  • Trigger downstream transcoding pipeline upon successful processing

Non-Functional Requirements:

  • Handle 5 million uploads per day (60 uploads/sec average, 200/sec peak)
  • Resumable upload: clients can resume after network failure without re-uploading completed chunks
  • End-to-end processing (upload complete → ready for transcoding) in under 5 minutes for a 1GB file
  • Zero data loss: every successfully acknowledged upload must be durably stored
  • 99.9% upload success rate (retries included)

Scale Estimation

5M uploads/day at average 500MB = 2.5PB/day of raw upload data. At 60 uploads/sec with chunked upload (5MB chunks), that is 6,000 chunk writes/sec to object storage. Metadata extraction: 5M ffprobe operations/day = 58 ops/sec. Virus scanning: 5M scans/day at average 30 seconds/scan = 1,740 concurrent scan workers needed. Thumbnail generation: 5M × 5 thumbnails = 25M thumbnails/day. Total object storage writes (raw video + thumbnails + metadata): ~8M objects/day.

High-Level Architecture

The upload service is divided into three layers: the Upload Layer, the Processing Layer, and the Notification Layer. The Upload Layer exposes a chunked upload API (compatible with the tus resumable upload protocol). The client initiates an upload session, receives an upload_id and a list of pre-signed S3 URLs (one per chunk). The client uploads chunks in parallel (up to 5 concurrent) directly to S3, bypassing the application server for data transfer. As each chunk completes, S3 sends an event notification to an SQS queue. A Chunk Tracker Service consumes these events and maintains chunk completion state in Redis (a bitmap of completed chunks per upload_id).

Once all chunks are received, the Chunk Tracker triggers the Processing Layer. A Processing Orchestrator (Step Functions or Temporal) runs a workflow: (1) Assemble — S3 multipart upload complete API merges chunks into the final object; (2) Validate — ffprobe extracts codec info and verifies the file is a valid video container; (3) Scan — the file is sent to a virus/malware scanner (ClamAV cluster) and a content safety classifier (ML model for NSFW detection); (4) Metadata — extract duration, resolution, bitrate, audio/subtitle tracks; (5) Thumbnails — FFmpeg extracts frames at 10%, 30%, 50%, 70%, 90% of duration and generates 640×360 JPEGs; (6) Register — write metadata to the Video Metadata DB and emit an event to trigger downstream transcoding.

The Notification Layer provides upload progress and processing status. A WebSocket connection or SSE stream pushes real-time progress (chunk upload %, processing stage) to the client. Alternatively, the client can poll a status endpoint.

Core Components

Resumable Upload Manager

The upload protocol follows the tus specification. On POST /uploads, the server creates an upload_id, computes the chunk plan (file_size / 5MB = N chunks), stores the upload metadata in DynamoDB (upload_id, user_id, file_name, file_size, chunk_count, status, created_at, expires_at), and returns pre-signed S3 URLs for each chunk. The client uploads chunks via PUT to S3 directly. If the client disconnects, it can call HEAD /uploads/{upload_id} to get the bitmap of completed chunks and resume only the missing ones. Uploads expire after 24 hours; a cleanup Lambda deletes orphaned chunks from S3.

Content Safety Pipeline

Every upload passes through a content safety pipeline before being accepted. The pipeline runs three checks in parallel: (1) ClamAV virus scan — the file is streamed through a ClamAV daemon for signature-based malware detection; (2) NSFW classifier — a ResNet-based image classifier runs on 10 sampled frames; if any frame exceeds the NSFW threshold, the video is flagged for human review; (3) Audio fingerprinting — the audio track is fingerprinted against a database of copyrighted music (similar to YouTube's Content ID). If the virus scan fails, the upload is rejected immediately. NSFW and copyright flags are recorded but do not block processing — human moderators review flagged content asynchronously.

Thumbnail Generator

The Thumbnail Generator extracts frames at 5 configurable timestamps using FFmpeg's -ss (seek) and -vframes 1 flags. Extracted frames are resized to multiple dimensions (160×90 for grid view, 320×180 for cards, 640×360 for preview, 1280×720 for detail page) using libvips for high-performance image processing. Thumbnails are uploaded to S3 with immutable cache-control headers and registered in the metadata database. A sprite sheet (contact sheet of all thumbnails at 160×90) is generated for scrub preview during video playback.

Database Design

Upload session data is stored in DynamoDB (upload_id as partition key) with fields: user_id, file_name, file_size, chunk_count, chunks_completed (bitmap stored as base64), status (uploading/assembling/processing/complete/failed), created_at, expires_at, error_message. DynamoDB's TTL feature automatically deletes expired sessions. The table has a GSI on user_id for listing a user's uploads.

Video metadata after processing is stored in PostgreSQL: Videos table (video_id, user_id, title, description, duration_ms, width, height, codec, bitrate_kbps, audio_codec, subtitle_tracks JSON, file_size_bytes, s3_raw_path, status, safety_flags JSON, created_at). Thumbnail URLs are stored in a separate Thumbnails table (video_id, timestamp_pct, s3_urls JSON containing all resolution variants). A Kafka topic video-uploads-completed receives an event for each successfully processed video, consumed by the transcoding pipeline.

API Design

  • POST /api/v1/uploads — Initiate an upload; body contains file_name, file_size, content_type; returns upload_id, chunk_size, chunk_count, presigned_urls[]
  • HEAD /api/v1/uploads/{upload_id} — Check upload progress; returns completed chunk bitmap and status
  • POST /api/v1/uploads/{upload_id}/complete — Signal all chunks uploaded; triggers assembly and processing pipeline
  • GET /api/v1/uploads/{upload_id}/status — Poll processing status; returns current stage, progress percentage, and any safety flags

Scaling & Bottlenecks

The primary bottleneck is S3 throughput during chunk uploads. At 6,000 chunk writes/sec, S3 handles this comfortably (S3 supports 5,500 PUT/sec per prefix). However, to avoid prefix throttling, chunk keys are distributed across multiple prefixes using hash-based key distribution: s3://uploads/{hash(upload_id)[0:2]}/{upload_id}/chunk_{n}. This spreads writes across 256 prefixes, each handling only ~24 writes/sec._

The content safety pipeline is the second bottleneck. ClamAV scanning at 30 seconds per file requires 1,740 concurrent workers for 5M files/day. Running ClamAV as a horizontally scaled service (containerized daemons behind an internal load balancer) with auto-scaling based on queue depth keeps scan latency predictable. The NSFW classifier runs on GPU instances (g5.xlarge) and processes 10 frames in ~2 seconds — much faster than virus scanning, so it is not the bottleneck. The entire processing pipeline is designed to be idempotent: if any step fails, the Orchestrator retries from that step without re-running completed steps.

Key Trade-offs

  • Direct-to-S3 upload vs proxy through application: Direct upload eliminates the application server as a bandwidth bottleneck (saving 2.5PB/day of proxy traffic) but requires pre-signed URLs and client-side retry logic
  • Chunked upload vs single PUT: Chunks enable resumability and parallel upload (critical for mobile on flaky networks) but add complexity in chunk tracking and assembly — the reliability benefit is non-negotiable for large files
  • Synchronous vs async content safety: Running safety checks synchronously blocks the upload from being available but prevents harmful content from ever being served; async allows faster availability but risks brief exposure — the hybrid approach (block on virus, async on NSFW) balances both
  • FFmpeg for thumbnails vs dedicated image extraction library: FFmpeg is slower for seek-and-extract but handles all container formats universally; a specialized library (e.g., GStreamer) could be faster but requires format-specific plugins

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.