System Design: Smart City Platform

Requirements

Functional Requirements:

Ingest real-time data from city sensors: traffic cameras, air quality monitors, noise sensors, smart streetlights, and utility meters
Unified operational dashboard: real-time map view of city systems, alert summaries, and KPI panels
Traffic management: real-time signal timing optimization based on traffic flow data
Emergency response integration: dispatch routing, incident location tracking, resource availability
Environmental monitoring: air quality alerts, noise pollution mapping, flood sensor alerts
Predictive analytics: traffic congestion forecasting, energy demand prediction, infrastructure failure prediction

Non-Functional Requirements:

Ingest data from 5 million city sensors at varying frequencies (1 Hz to 1/minute)
Critical alerts (flood sensor breach, emergency incident) delivered to operators within 5 seconds
Geospatial queries ("show all air quality alerts within 500m of location X") return in under 500ms
Platform serves 50,000 city operators and analysts concurrently
99.9% uptime for critical safety systems (flood alerts, emergency dispatch)

Scale Estimation

5M sensors with an average publishing rate of 5 messages/minute = 416k messages/second. At 250 bytes each = 100 MB/second of ingestion. Traffic camera feeds (1k cameras, JPEG snapshots every 5 seconds for computer vision processing) add 1k × 200 KB / 5s = 40 MB/second. Total ingestion: ~140 MB/second. Geospatial query load: 50k operators each making 2 queries/minute = 1,667 queries/second. With a PostGIS spatial index, a 500m radius query over 5M sensor locations completes in <10ms, supporting 100k queries/second on a single PostGIS node.

High-Level Architecture

The platform organizes around domain-specific data pipelines that converge into a unified operational layer. Domain pipelines: traffic (loop detectors, cameras → traffic management engine), environmental (air/noise/water sensors → environmental monitoring service), utilities (smart meters, grid sensors → utility management service), and emergency services (incident reports, dispatch locations → emergency coordination service). Each domain pipeline ingests from its sensor sources via MQTT/HTTP, processes via a domain-specific Flink job, and publishes processed events to a shared event bus (Kafka). The operational dashboard consumes from the shared event bus and PostgreSQL/PostGIS for geospatial queries.

Critical alert path: sensors classified as "critical" (flood sensors, emergency panic buttons, power outage detectors) bypass the standard pipeline and write directly to a high-priority Kafka topic with dedicated consumer groups. The alert service processes this topic with a dedicated Flink job, applies alert suppression logic (de-dup within 30-second window), and delivers alerts via PagerDuty-style escalation: in-app notification → SMS → phone call for un-acknowledged critical alerts. Alert delivery must complete in under 5 seconds — so alert evaluation SLA is set at <2 seconds, leaving 3 seconds for notification delivery.

Geospatial layer: sensor locations, incidents, and asset locations are stored in PostGIS (PostgreSQL + PostGIS extension). A spatial indexing service maintains a GiST index on all location data. The real-time map is served by a WebSocket-based map update service: when a sensor's state changes, its new state is pushed to all operators viewing the relevant map area via a publish-subscribe system (Redis Geo + pub/sub). Operators subscribe to a geographic bounding box; the map update service fans out sensor state changes only to operators whose viewport contains the sensor's location.

Core Components

Traffic Management Engine

The traffic engine consumes loop detector counts and turning movement counts from intersections. A traffic signal optimization algorithm (Webster's formula for isolated intersections, or a coordinated adaptive algorithm like SCOOT for arterials) computes optimal green time splits every 60 seconds per intersection. Commands are sent to traffic signal controllers via a field communication gateway (NTCIP 1202 protocol over cellular or fiber). The engine also processes camera feeds: JPEG frames are sent to a computer vision microservice (YOLO-based vehicle counting and queue length detection) running on GPU inference nodes. Counted vehicle data feeds back into the signal optimization as supplementary loop count data.

Environmental Monitoring Service

Air quality sensors report PM2.5, PM10, NO2, O3, and CO2 concentrations. The environmental service applies sensor calibration corrections (polynomial drift correction using monthly field calibration data), computes AQI (Air Quality Index) using EPA formula, and maps readings to a spatial grid (1 km × 1 km cells using bilinear interpolation between sensor locations). AQI values above unhealthy thresholds trigger public alert broadcasts (mobile push notifications to citizens subscribed to air quality alerts for their area, via a citizen-facing companion app). AQI interpolation maps are generated every 5 minutes and served via pre-rendered tile layers (Mapbox-compatible raster tiles stored in S3).

Emergency Coordination Service

Emergency incidents are created by 911 dispatch systems, IoT panic buttons, or automatic detections (fire sensor triggers, car crash detection from accelerometers). Each incident has a location, type, severity, and status. The service: (1) displays the incident on the operator dashboard in real time; (2) suggests available emergency resources within proximity using a spatial query against the resource location store (real-time GPS positions of police, fire, ambulance units stored in Redis Geo); (3) computes estimated travel time to incident using live traffic data from the traffic engine; (4) allows dispatchers to assign units and track their routes. Incident state is synchronized across all dispatcher workstations in real time via WebSocket.

Database Design

Postgres + PostGIS: sensors (sensor_id, sensor_type, location GEOGRAPHY(POINT), district_id, installation_date, status), incidents (incident_id, type, severity, location GEOGRAPHY(POINT), status, created_at, resolved_at, assigned_unit_ids[]), emergency_units (unit_id, type, current_location GEOGRAPHY(POINT), status, updated_at). TimescaleDB: sensor_readings (sensor_id, metric, value, quality, recorded_at) — time-series telemetry, 90-day retention. ClickHouse: traffic_flow (intersection_id, direction, count, occupancy, speed, recorded_at), environmental_readings (sensor_id, pm25, pm10, no2, aqi, recorded_at) — for analytics queries. Redis: emergency:units:geo (Geo set of active unit locations), alerts:active (sorted set of active alerts by severity), dashboard:viewport:{operator_id} (operator's current map bounding box). Kafka: domain topics traffic-events, environmental-events, utility-events, emergency-events, critical-alerts.

API Design

GET /map/sensors?bbox={minLat,minLon,maxLat,maxLon}&types={t1,t2} — returns sensors and their latest state within the bounding box; PostGIS spatial query with Redis state overlay
WebSocket /dashboard/realtime?bbox={...} — streams state updates for sensors within the operator's viewport; server filters events by geoboundary
GET /traffic/intersections/{intersection_id}/signal-plan — returns current signal timing plan and recent flow counts
POST /incidents — body: {type, location, severity, description}, creates incident, triggers assignment workflow, broadcasts to all dispatchers
GET /analytics/airquality/map?time={ts} — returns AQI interpolation map tiles for a given timestamp from S3 tile cache
GET /analytics/traffic/congestion-forecast?from={ts}&duration_hours=3 — returns predicted congestion scores per district for next 3 hours

Scaling & Bottlenecks

Geospatial query load of 1,667 queries/second on PostGIS: spatial indexes (GiST on GEOGRAPHY columns) support ~10k queries/second on a modern server for bounding box queries. A read replica fleet (3 replicas, connection-pooled via PgBouncer) handles this comfortably. The real-time map WebSocket fan-out is the harder problem: 50k operators viewing different viewport areas, each receiving state changes for sensors in their area. A naive approach (send every sensor update to every operator, filter client-side) would require broadcasting 416k updates/second to 50k WebSocket connections = 20 billion messages/second. The geofence-based server-side filtering (only send updates to operators whose viewport contains the sensor) reduces fan-out by 99%+ for city-scale viewports.

Critical alert delivery SLA of 5 seconds: the alert pipeline (Kafka → Flink evaluation → notification dispatch → SMS/push) must complete in <5 seconds. Kafka consumer lag for the critical-alerts topic must stay at zero (no lag) — achieved by dedicated consumers with sufficient parallelism and guaranteed resource allocation. SMS delivery adds 1-3 seconds latency by itself (carrier network); use push notifications as the primary sub-second channel and SMS as a fallback for un-acknowledged alerts after 30 seconds.

Key Trade-offs

Centralized vs. federated smart city data: A centralized platform simplifies cross-domain analytics (correlate traffic and air quality) but creates a single point of failure and a high-value attack target; a federated architecture (each department runs its own system with API integration) improves resilience but complicates unified dashboards.
Real-time vs. near-real-time for non-critical sensors: Processing all 5M sensors at real-time latency requires expensive stream processing infrastructure; classifying sensors by criticality (critical, operational, analytical) and applying different latency SLAs (5s, 30s, 5min) reduces infrastructure cost by 80% without impacting safety outcomes.
Vendor-specific smart city platforms vs. open: Proprietary platforms (Cisco Kinetic, IBM Intelligent Operations Center) reduce integration work but create vendor lock-in and high per-device licensing costs; open platforms (Fiware, open standards) are more flexible but require more internal integration engineering.
Camera-based traffic detection vs. loop detectors: Computer vision on cameras provides richer data (vehicle type, queue length, turning movements) but requires GPU infrastructure and raises privacy concerns; inductive loop detectors are lower-cost, lower-maintenance, and privacy-preserving but provide only count and occupancy data.