System Design: Video Upload & Processing Service

Requirements

Functional Requirements:

Accept video uploads up to 10GB in size via resumable, chunked upload protocol
Validate video format, codec, and integrity before processing
Scan uploaded content for malware and prohibited material
Extract metadata (duration, resolution, codec, bitrate, audio tracks, subtitles)
Generate preview thumbnails at multiple timestamps
Trigger downstream transcoding pipeline upon successful processing

Non-Functional Requirements:

Handle 5 million uploads per day (60 uploads/sec average, 200/sec peak)
Resumable upload: clients can resume after network failure without re-uploading completed chunks
End-to-end processing (upload complete → ready for transcoding) in under 5 minutes for a 1GB file
Zero data loss: every successfully acknowledged upload must be durably stored
99.9% upload success rate (retries included)

Scale Estimation

5M uploads/day at average 500MB = 2.5PB/day of raw upload data. At 60 uploads/sec with chunked upload (5MB chunks), that is 6,000 chunk writes/sec to object storage. Metadata extraction: 5M ffprobe operations/day = 58 ops/sec. Virus scanning: 5M scans/day at average 30 seconds/scan = 1,740 concurrent scan workers needed. Thumbnail generation: 5M × 5 thumbnails = 25M thumbnails/day. Total object storage writes (raw video + thumbnails + metadata): ~8M objects/day.

High-Level Architecture

The upload service is divided into three layers: the Upload Layer, the Processing Layer, and the Notification Layer. The Upload Layer exposes a chunked upload API (compatible with the tus resumable upload protocol). The client initiates an upload session, receives an upload_id and a list of pre-signed S3 URLs (one per chunk). The client uploads chunks in parallel (up to 5 concurrent) directly to S3, bypassing the application server for data transfer. As each chunk completes, S3 sends an event notification to an SQS queue. A Chunk Tracker Service consumes these events and maintains chunk completion state in Redis (a bitmap of completed chunks per upload_id).

Once all chunks are received, the Chunk Tracker triggers the Processing Layer. A Processing Orchestrator (Step Functions or Temporal) runs a workflow: (1) Assemble — S3 multipart upload complete API merges chunks into the final object; (2) Validate — ffprobe extracts codec info and verifies the file is a valid video container; (3) Scan — the file is sent to a virus/malware scanner (ClamAV cluster) and a content safety classifier (ML model for NSFW detection); (4) Metadata — extract duration, resolution, bitrate, audio/subtitle tracks; (5) Thumbnails — FFmpeg extracts frames at 10%, 30%, 50%, 70%, 90% of duration and generates 640×360 JPEGs; (6) Register — write metadata to the Video Metadata DB and emit an event to trigger downstream transcoding.

The Notification Layer provides upload progress and processing status. A WebSocket connection or SSE stream pushes real-time progress (chunk upload %, processing stage) to the client. Alternatively, the client can poll a status endpoint.

Core Components

Resumable Upload Manager

The upload protocol follows the tus specification. On POST /uploads, the server creates an upload_id, computes the chunk plan (file_size / 5MB = N chunks), stores the upload metadata in DynamoDB (upload_id, user_id, file_name, file_size, chunk_count, status, created_at, expires_at), and returns pre-signed S3 URLs for each chunk. The client uploads chunks via PUT to S3 directly. If the client disconnects, it can call HEAD /uploads/{upload_id} to get the bitmap of completed chunks and resume only the missing ones. Uploads expire after 24 hours; a cleanup Lambda deletes orphaned chunks from S3.

Content Safety Pipeline

Every upload passes through a content safety pipeline before being accepted. The pipeline runs three checks in parallel: (1) ClamAV virus scan — the file is streamed through a ClamAV daemon for signature-based malware detection; (2) NSFW classifier — a ResNet-based image classifier runs on 10 sampled frames; if any frame exceeds the NSFW threshold, the video is flagged for human review; (3) Audio fingerprinting — the audio track is fingerprinted against a database of copyrighted music (similar to YouTube's Content ID). If the virus scan fails, the upload is rejected immediately. NSFW and copyright flags are recorded but do not block processing — human moderators review flagged content asynchronously.

Thumbnail Generator

The Thumbnail Generator extracts frames at 5 configurable timestamps using FFmpeg's -ss (seek) and -vframes 1 flags. Extracted frames are resized to multiple dimensions (160×90 for grid view, 320×180 for cards, 640×360 for preview, 1280×720 for detail page) using libvips for high-performance image processing. Thumbnails are uploaded to S3 with immutable cache-control headers and registered in the metadata database. A sprite sheet (contact sheet of all thumbnails at 160×90) is generated for scrub preview during video playback.

Database Design

Upload session data is stored in DynamoDB (upload_id as partition key) with fields: user_id, file_name, file_size, chunk_count, chunks_completed (bitmap stored as base64), status (uploading/assembling/processing/complete/failed), created_at, expires_at, error_message. DynamoDB's TTL feature automatically deletes expired sessions. The table has a GSI on user_id for listing a user's uploads.

Video metadata after processing is stored in PostgreSQL: Videos table (video_id, user_id, title, description, duration_ms, width, height, codec, bitrate_kbps, audio_codec, subtitle_tracks JSON, file_size_bytes, s3_raw_path, status, safety_flags JSON, created_at). Thumbnail URLs are stored in a separate Thumbnails table (video_id, timestamp_pct, s3_urls JSON containing all resolution variants). A Kafka topic video-uploads-completed receives an event for each successfully processed video, consumed by the transcoding pipeline.

API Design

POST /api/v1/uploads — Initiate an upload; body contains file_name, file_size, content_type; returns upload_id, chunk_size, chunk_count, presigned_urls[]
HEAD /api/v1/uploads/{upload_id} — Check upload progress; returns completed chunk bitmap and status
POST /api/v1/uploads/{upload_id}/complete — Signal all chunks uploaded; triggers assembly and processing pipeline
GET /api/v1/uploads/{upload_id}/status — Poll processing status; returns current stage, progress percentage, and any safety flags

Scaling & Bottlenecks

The primary bottleneck is S3 throughput during chunk uploads. At 6,000 chunk writes/sec, S3 handles this comfortably (S3 supports 5,500 PUT/sec per prefix). However, to avoid prefix throttling, chunk keys are distributed across multiple prefixes using hash-based key distribution: s3://uploads/{hash(upload_id)[0:2]}/{upload_id}/chunk_{n}. This spreads writes across 256 prefixes, each handling only ~24 writes/sec._

The content safety pipeline is the second bottleneck. ClamAV scanning at 30 seconds per file requires 1,740 concurrent workers for 5M files/day. Running ClamAV as a horizontally scaled service (containerized daemons behind an internal load balancer) with auto-scaling based on queue depth keeps scan latency predictable. The NSFW classifier runs on GPU instances (g5.xlarge) and processes 10 frames in ~2 seconds — much faster than virus scanning, so it is not the bottleneck. The entire processing pipeline is designed to be idempotent: if any step fails, the Orchestrator retries from that step without re-running completed steps.

Key Trade-offs

Direct-to-S3 upload vs proxy through application: Direct upload eliminates the application server as a bandwidth bottleneck (saving 2.5PB/day of proxy traffic) but requires pre-signed URLs and client-side retry logic
Chunked upload vs single PUT: Chunks enable resumability and parallel upload (critical for mobile on flaky networks) but add complexity in chunk tracking and assembly — the reliability benefit is non-negotiable for large files
Synchronous vs async content safety: Running safety checks synchronously blocks the upload from being available but prevents harmful content from ever being served; async allows faster availability but risks brief exposure — the hybrid approach (block on virus, async on NSFW) balances both
FFmpeg for thumbnails vs dedicated image extraction library: FFmpeg is slower for seek-and-extract but handles all container formats universally; a specialized library (e.g., GStreamer) could be faster but requires format-specific plugins