SYSTEM_DESIGN
System Design: Emergency Alert System
Design a government emergency alert system that delivers critical notifications to millions of citizens across multiple channels — SMS, push, broadcast TV, radio, and digital signs — within seconds. Covers reliability, geo-targeting, and channel redundancy.
Requirements
Functional Requirements:
- Authorized officials create and send geo-targeted alerts with severity classification (Extreme, Severe, Moderate, Minor)
- Deliver alerts via: Wireless Emergency Alerts (WEA/cell broadcast), push notifications, SMS, email, digital signage, and broadcast EAS
- Target alerts by geographic polygon, radius, or administrative region
- Support alert updates and cancellations that propagate to all channels
- Maintain a public alert feed (API and web) showing active and historical alerts
- Role-based authorization: only vetted officials can issue Extreme/Severe alerts; local officials can issue Moderate/Minor
Non-Functional Requirements:
- Alert delivery must begin within 10 seconds of issuance for life-safety alerts
- 99.999% availability for the issuance and delivery path
- Deliver to 50M recipients within 5 minutes for nationwide alerts
- System must function during regional internet outages — fallback to satellite/broadcast paths
- All alert issuances logged with official identity and multi-factor authentication record
Scale Estimation
Nationwide alert: 50M mobile devices. WEA cell broadcast reaches all devices in a cell simultaneously — not per-device delivery. Push notification fan-out: 50M devices via APNs/FCM. Assuming 100k push/second per vendor pipeline = 8.3 minutes for 50M; need parallelism across multiple accounts. SMS: slower but critical for older devices — 10M SMS at 500/second = ~5.5 hours; pre-registered SMS-capable devices are prioritized. Normal alert volume: ~500 alerts/day nationwide, mostly local.
High-Level Architecture
The system is architected around a central Alert Management Service and a multi-channel Delivery Bus. The Alert Management Service handles the human issuance workflow: authentication, authorization, geo-polygon definition, alert composition, and the multi-step approval chain for highest-severity alerts. Once an alert is approved, it is published to a high-priority Kafka topic that feeds all delivery channel adapters in parallel.
Channel adapters are independent services: a WEA Adapter interfaces with carrier cell broadcast systems via FEMA's IPAWS gateway; a Push Adapter fans out to APNs and FCM; an SMS Adapter routes through multiple SMS aggregators with failover; a Digital Signage Adapter pushes to IP-connected sign controllers; a Broadcast EAS Adapter transmits CAP (Common Alerting Protocol) messages to broadcast stations. Each adapter is independently monitored and reports delivery telemetry back to a Status Aggregator.
Geographic targeting is computed by the Alert Management Service before publishing to the bus. For WEA/cell broadcast, the geographic polygon is passed directly to carriers. For push notifications, device location is either from device registration (home location) or real-time location for apps with location permission — the Geo Resolver Service resolves which device tokens fall within the target polygon.
Core Components
Alert Management & Authorization Service
Handles the issuance workflow with a multi-factor authenticated portal for officials. Extreme-severity alerts require two-person authorization (initiator + approver from different accounts). The service validates the official's jurisdiction against the alert's target geography — a county official cannot issue a statewide alert without elevation. Draft alerts auto-expire after 30 minutes if not sent. All drafts, approvals, and issuances are written to an immutable audit log in real time.
Multi-Channel Fan-out Engine
The Kafka-to-delivery pipeline uses partitioned topics per severity level, ensuring Extreme alerts preempt queue backlog from lower-severity messages. Each channel adapter maintains its own delivery queue with priority routing. The fan-out engine tracks delivery attempts per channel per alert, retries failed deliveries with exponential backoff, and reports per-channel delivery rates to the Status Aggregator. For WEA (cell broadcast), delivery is inherently broadcast — no per-device tracking is possible or required.
Geo Resolution Service
Maintains a spatial index (PostGIS or in-memory R-tree) of registered device tokens mapped to home location or last known location. When an alert polygon is submitted, the service runs a spatial query to return matching device tokens in batches for the push notification channel. Location data is coarsened to ZIP/postcode level for privacy — the system does not need precise GPS coordinates to determine if a device is in an alert zone. The index is refreshed as users update their location preferences in the citizen app.
Database Design
Alert records are stored in PostgreSQL: alert_id UUID, issuer_id, severity ENUM, title, description, target_geometry GEOGRAPHY, issued_at, expires_at, status ENUM, channel_statuses JSONB. The target_geometry column uses a GiST spatial index for fast containment queries. An alert_audit table captures every state change with actor and timestamp.
Device token registrations for push notifications are stored in a Redis cluster (for fast geo lookups) and PostgreSQL (durable). The Redis index uses geohash-based keys: devices:geohash:{hash} → sorted set of device tokens. Geo queries convert the alert polygon to a set of covering geohashes and union the device token sets. This provides sub-second fan-out list generation for regions up to state-size polygons.
API Design
POST /api/v1/alerts — official portal endpoint to create and issue an alert; requires MFA token in header; body includes severity, title, description, target_polygon, expires_in_minutes.
PUT /api/v1/alerts/{alertId}/cancel — immediately cancels an active alert and triggers cancellation messages on all channels.
GET /api/v1/alerts/active?lat={lat}&lon={lon} — public endpoint returning active alerts affecting a given location.
GET /api/v1/alerts/{alertId}/delivery-status — returns per-channel delivery telemetry for officials monitoring alert propagation.
Scaling & Bottlenecks
The push notification fan-out to 50M devices is the hardest scaling problem. APNs and FCM have per-account throughput limits (~100k/second). The solution is to use multiple APNs/FCM accounts (requiring Apple/Google relationship management), combined with sharding the device token list and distributing across accounts. HTTP/2 multiplexing to APNs allows batching to maximize throughput. Pre-warming the fan-out worker fleet during peacetime (based on seasonal risk patterns) reduces cold-start latency.
System availability during regional disasters is a core requirement — exactly when the system is most needed, infrastructure may be damaged. The Alert Management Service is multi-region active-active. Critical delivery paths (WEA, broadcast EAS) use dedicated out-of-band connectivity (satellite uplinks, dedicated fiber) independent of public internet infrastructure. Cell broadcast via WEA is the most resilient channel — it works even when device internet connectivity is absent.
Key Trade-offs
- Cell broadcast vs. targeted push: WEA cell broadcast is instant and highly reliable but cannot be targeted below cell-tower granularity and has strict message size limits; push notifications are precise but require internet connectivity and are slower to fan out.
- Real-time location vs. home location: Real-time location targeting is more accurate but requires user consent and active app engagement; home/registration-location targeting works for all registered users but misses travelers in the alert zone.
- Alert fatigue vs. coverage: Sending alerts for minor events increases awareness but causes recipients to disable alerts; tiered severity and geographic precision reduce false-relevance alerts and maintain public trust.
- Two-person authorization vs. speed: Requiring a second approver for extreme alerts prevents false alarms but adds 30-60 seconds of latency in a life-safety context; a configurable bypass for single-person issuance with immediate post-facto audit is a common compromise.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.