System Design: DDoS Protection System

Requirements

Functional Requirements:

Detect and mitigate volumetric DDoS attacks (UDP flood, ICMP flood, amplification attacks) at the network edge
Defend against protocol attacks (SYN flood, ACK flood, fragmented packet attacks) without blocking legitimate traffic
Detect and block application-layer attacks (HTTP flood, Slowloris, credential stuffing disguised as normal traffic)
Provide an emergency mode: block all traffic from attacking IP ranges within 30 seconds of attack detection
Allow legitimate traffic to pass with less than 5ms additional latency during active mitigation
Provide a self-service rule management interface for customers to define custom block/challenge rules

Non-Functional Requirements:

Absorb volumetric attacks of up to 1 Tbps without infrastructure saturation
Mitigation activation within 10 seconds of attack detection
False positive rate (blocking legitimate users) under 0.01%
99.999% availability for the mitigation plane
Traffic scrubbing adds less than 5ms latency for clean traffic

Scale Estimation

A 1 Tbps UDP flood = 1.25 billion 100-byte packets/second. No single data center can absorb this; traffic must be distributed across a global anycast network with 20+ PoPs (Points of Presence), each absorbing 50 Gbps. At normal (non-attack) traffic: 500 Gbps global, 25 Gbps per PoP. Attack detection must analyze flow telemetry (NetFlow/IPFIX) at 10 million flows/second across the entire network.

High-Level Architecture

DDoS protection operates at three layers: L3/L4 (network/transport), L7 (application), and analytics (attack detection and policy coordination). At L3/L4, anycast routing distributes attack traffic across a global scrubbing network. At L7, a reverse-proxy cluster runs behavioral analysis and challenge-response mechanisms. The analytics layer processes flow telemetry and application logs to detect attack patterns and push mitigation rules.

Anycast routing: origin servers announce the same IP prefix from all PoPs via BGP. Attackers send traffic to this IP; BGP routing delivers it to the nearest PoP based on the attacker's ISP peering. Each PoP performs L3/L4 filtering (block known bad IP ranges, rate-limit per-source IP) and forwards clean traffic via GRE tunnels or direct peering to the origin servers. During a large attack, BGP communities signal upstream ISPs to apply Remotely Triggered Black Hole (RTBH) routing for the most abusive source ranges.

Application-layer protection runs on the reverse-proxy layer. A traffic scoring engine assigns each request a risk score based on: IP reputation (threat intelligence feeds), header anomalies (missing common browser headers, invalid User-Agent), TLS fingerprint (JA3 hash matching known attack toolkits), request rate per IP, and behavioral signals (request pattern matches attack signatures). High-risk requests receive a JavaScript challenge (CAPTCHAs or invisible proof-of-work); medium-risk requests are served with rate limiting; low-risk requests pass through normally.

Core Components

NetFlow-Based Attack Detection

All edge routers export NetFlow/sFlow records (sampled 1:1000) to a central collector. An Apache Flink job processes flow records in real time, computing: per-source-IP packet rate, per-destination-IP packet rate, protocol distribution, and packet size distribution. A threshold detector fires when any destination receives >10Gbps from a single source IP or >100Gbps aggregate. An ML anomaly detector (trained on historical traffic patterns) identifies unusual protocol distributions indicating amplification attacks (DNS, NTP, SSDP reflection).

L3/L4 Scrubbing Pipeline

Upon attack detection, the mitigation controller pushes ACLs (Access Control Lists) to edge routers via NETCONF/gRPC. ACLs specify: drop all traffic from attacking source IPs/CIDRs, rate-limit UDP traffic to 10% of normal, enable SYN cookies for TCP connections (validates TCP handshake without allocating state, preventing SYN flood state exhaustion). SYN cookies encode the TCP sequence number as a cryptographic function of the source/dest IP and port, verifying the ACK without storing half-open connection state.

Application-Layer Behavioral Analysis

The behavioral analysis engine maintains per-IP request counters in Redis (sliding windows: 1s, 10s, 60s, 600s). A request scoring function combines: IP reputation score (external threat feed lookup), request rate deviation from baseline (Z-score), TLS fingerprint mismatch (JA3 hash not in legitimate browser list), and request pattern similarity (edit distance from known attack patterns). Challenge mechanisms: Proof-of-Work (client must compute a hash with N leading zero bits, taking 100–500ms CPU time) stops automated HTTP flood tools without impacting human users.

Database Design

Redis Cluster for rate limiting: ratelimit:{ip}:{window} → counter. Redis for IP reputation cache: iprep:{ip} → risk_score (TTL 1 hour). PostgreSQL for mitigation rules: rules (rule_id, type ENUM(BLOCK_IP, CHALLENGE, RATE_LIMIT), criteria JSON, action JSON, created_by, expires_at, is_active). TimescaleDB for attack telemetry: (ts TIMESTAMP, source_ip INET, dest_ip INET, protocol INT, pps BIGINT, bps BIGINT, attack_type VARCHAR) partitioned by hour. S3 for raw flow archives (compressed IPFIX) retained 90 days.

API Design

POST /rules — Create a custom mitigation rule (block IP CIDR, rate-limit endpoint, challenge country). GET /attacks/active — Return currently active attacks with type, volume, source distribution, and mitigation status. POST /attacks/{attack_id}/escalate — Escalate mitigation to emergency mode (RTBH for attacking prefixes). GET /traffic/live — Real-time traffic dashboard: clean vs. attack traffic volumes by PoP and protocol.

Scaling & Bottlenecks

The scrubbing network's aggregate capacity (1 Tbps) can be exceeded by nation-state level attacks (10+ Tbps documented). Upstream ISP partnerships for pre-scrubbing (Tier-1 ISP drops attack traffic before it reaches the scrubbing network) extend effective capacity. Anycast distribution limits per-PoP attack volume; BGP traffic engineering can redistribute attack load from saturated PoPs to underloaded ones within 60 seconds.

Application-layer detection latency: analyzing each request's behavioral signals must complete within the 5ms budget. Local Redis lookups (0.5ms), in-process IP reputation cache (0.1ms), and JA3 fingerprint comparison (0.1ms) all fit within budget. Machine learning-based detection (requiring feature assembly and model inference) runs asynchronously: the first request from an IP is allowed through while the risk score is being computed; subsequent requests within 100ms use the computed score.

Key Trade-offs

Anycast distribution vs. centralized scrubbing: Anycast distributes attack traffic globally and brings mitigation closer to the attack source; centralized scrubbing simplifies rule management but creates a single absorption bottleneck.
Rate limiting vs. behavioral challenge: Rate limiting is simple and effective against volume attacks but has high false positives; behavioral challenges (JS, CAPTCHA) accurately distinguish humans from bots but add latency and friction for legitimate users.
Aggressive vs. conservative blocking: Aggressive blocking (block entire ASNs, country blocks) stops attacks faster but risks blocking legitimate users in those regions; conservative blocking minimizes false positives but allows some attack traffic through during signature learning.
On-premises vs. cloud DDoS protection: Cloud-based DDoS protection (Cloudflare, AWS Shield) provides massive capacity and zero infrastructure management but routes all traffic through a third party; on-premises scrubbing appliances keep traffic in-house but are limited to available bandwidth.

System Design: DDoS Protection System

Requirements

Scale Estimation

High-Level Architecture

Core Components

NetFlow-Based Attack Detection

L3/L4 Scrubbing Pipeline

Application-Layer Behavioral Analysis

Database Design

API Design

Scaling & Bottlenecks

Key Trade-offs

Master this topic in our 12-week cohort

System Design: API Key Management

System Design: Zero Trust Network Architecture

System Design: Content Delivery Network (CDN)

How to Design a URL Shortener (TinyURL)

System Design: Instagram

System Design: Twitter/X