System Design: Sensor Data Processing Pipeline

Requirements

Functional Requirements:

Ingest raw sensor readings from distributed sensors at high frequency (1-100 Hz per sensor)
Apply real-time transformations: unit conversion, calibration corrections, and rolling window aggregations
Detect anomalies: threshold breaches, statistical outliers, and pattern deviations in real time
Write processed data to a time-series database for querying and visualization
Generate alerts when sensor values cross configurable thresholds
Support sensor metadata management: sensor registry, calibration parameters, alert thresholds

Non-Functional Requirements:

Ingest 50 million sensor readings per second at peak
End-to-end processing latency (sensor reading to alert) under 1 second
Zero data loss for critical sensors; best-effort for low-priority sensors
Time-series data retained for 5 years with downsampling for older data
Support 1 million distinct sensors with independent calibration and alert configurations

Scale Estimation

1M sensors × 50 readings/second average = 50M readings/second. At 100 bytes per reading (sensor_id, value, timestamp, quality_flag), that's 5 GB/second of ingestion. Kafka at 5 GB/second requires ~50 brokers (100 MB/second per broker with 3× replication factor). Processing latency budget: ingest → Kafka (100ms) → stream processor (200ms) → anomaly check (100ms) → DB write (300ms) → alert (200ms) = 900ms total — fits within the 1-second SLA. Time-series storage: 50M readings/second × 86,400 seconds/day = 4.3 trillion readings/day. Compressed at 10 bytes/reading (Gorilla compression for time-series), that's 43 TB/day. Over 5 years = 78 PB; downsampling older data (5-minute averages after 90 days) reduces this to ~5 PB.

High-Level Architecture

The pipeline uses a Lambda-like architecture with a speed layer (stream processing) and a batch layer (historical aggregation and model training). The speed layer handles real-time ingestion, transformation, anomaly detection, and alerting. The batch layer handles historical aggregation for storage efficiency, ML model retraining for adaptive anomaly detection, and compliance reporting.

Ingestion: sensors publish readings to the ingestion gateway (MQTT or HTTP) which writes to Kafka after schema validation and enrichment (appending sensor metadata like location and calibration coefficients from a Redis-cached sensor registry). Kafka topics are partitioned by sensor_id hash, ensuring all readings from one sensor go to the same partition and arrive in order.

Stream processing (Apache Flink): operators consume from Kafka, apply calibration corrections (multiply raw ADC value by calibration coefficient from sensor registry), compute rolling window aggregations (60-second average, min, max, std_dev using Flink's sliding window API), run anomaly detection (threshold check + 3σ statistical outlier detection against the 60-second baseline), and write processed readings to the time-series database. Anomaly events are published to an alert Kafka topic consumed by the alerting service.

Core Components

Ingestion Gateway

The gateway accepts sensor data via MQTT (persistent, low-overhead protocol for constrained devices) and HTTP (for cloud-native sensors). Validation: each incoming message is validated against the sensor's registered schema (from the sensor registry cache) — data type, value range plausibility, and timestamp recency (reject messages older than 5 minutes to prevent replay attacks). Enrichment: the gateway adds calibration coefficients and quality flags (from sensor registry Redis cache) to each reading before publishing to Kafka. The gateway is stateless and horizontally scaled; authentication uses pre-shared API keys per sensor (stored in Redis for sub-millisecond lookup).

Stream Processing Engine

Flink jobs consume from Kafka with operator parallelism matching Kafka partition count (1,000 partitions = 1,000 parallel operators). The transformation operator applies: unit conversion (e.g., raw voltage to Celsius using polynomial calibration), timestamp normalization (convert device local time to UTC using per-sensor timezone), and data quality scoring (flag readings with noise above calibration tolerance). The aggregation operator maintains keyed state (keyed by sensor_id) with two tumbling windows: 1-minute and 1-hour. Per window, it computes (count, sum, min, max, sum_of_squares) — from these, mean and variance are derived without storing all values. Aggregated summaries are written to TimescaleDB as continuous aggregates.

Anomaly Detection Engine

Anomaly detection runs as a Flink operator after transformation. Two detection methods: (1) static threshold — each sensor has configured high/low bounds stored in the sensor registry; a reading outside bounds triggers an immediate alert; (2) dynamic anomaly detection — using a streaming z-score: maintain a running mean and standard deviation using Welford's online algorithm; a reading more than 3σ from the rolling mean is flagged as an anomaly. Dynamic detection adapts to sensors with diurnal patterns (e.g., temperature sensors) — the baseline updates as the reading series evolves. Anomaly events are de-duplicated (same sensor, same type, within 5-minute window = one alert, not thousands) using a Redis-based deduplication set.

Database Design

Kafka: sensor-raw (ingested readings, Protobuf-serialized, 7-day retention), sensor-anomalies (anomaly events, 30-day retention). TimescaleDB: sensor_readings (sensor_id, value, unit, quality_score, recorded_at) — hypertable with 7-day chunks, compressed after 7 days using the Gorilla algorithm (time-series-aware delta-of-delta encoding); sensor_aggregates_1m (sensor_id, avg, min, max, stddev, count, bucket — 1-minute continuous aggregate); sensor_aggregates_1h (1-hour aggregate). PostgreSQL: sensors (sensor_id, name, location_json, unit, calibration_coefficients_json, alert_config_json, owner_id, status). Redis: sensor:{sensor_id}:meta (calibration, alert thresholds — TTL 1h), anomaly_dedup:{sensor_id}:{type} (deduplication set, TTL 5m). S3: long-term archive (Parquet, 90+ days, partitioned by sensor_id/year/month).

API Design

POST /ingest/{sensor_id} — body: {value, timestamp, unit} or batch array, validates, publishes to Kafka; rate-limited per sensor_id
GET /sensors/{sensor_id}/readings?from={ts}&to={ts}&resolution={1m|1h|1d} — returns time-series data from TimescaleDB; resolution maps to continuous aggregate table
GET /sensors/{sensor_id}/anomalies?from={ts}&to={ts} — returns anomaly events with context (value at time, threshold, 3σ bound)
PUT /sensors/{sensor_id}/config — body: {alert_thresholds, calibration_coefficients}, updates sensor registry in PostgreSQL and invalidates Redis cache
POST /alerts/subscribe — body: {sensor_ids[], webhook_url, alert_types[]}, registers webhook for alert delivery

Scaling & Bottlenecks

Kafka ingestion at 5 GB/second is the primary throughput challenge. Kafka's write throughput is primarily limited by disk I/O (WAL writes) and network. Using NVMe SSDs and Kafka's log compaction disabled for high-throughput topics, a single broker handles ~500 MB/second write. 50 brokers handle the load with 3× replication. Flink's state store (RocksDB) for 1M keyed sensor states in the aggregation operator requires ~1 GB RAM per task manager at 1 KB/sensor state. With 1,000 Flink task slots across 100 task managers, each task manager holds ~10k sensor states = 10 MB RAM — trivial.

TimescaleDB write throughput: 50M rows/second cannot be written row-by-row. Flink's TimescaleDB sink batches 10,000 rows per INSERT, reducing write calls to 5,000/second. TimescaleDB with 10 nodes, each receiving 500 COPY calls/second, handles this workload. The continuous aggregate refresh (1-minute aggregates) runs every 30 seconds via a scheduled background worker in TimescaleDB — decoupled from the insert path.

Key Trade-offs

Static vs. dynamic anomaly detection: Static thresholds are simple to configure and explain to operators but miss context-dependent anomalies (e.g., a temperature of 50°C is normal in a furnace room but anomalous in an office); dynamic 3σ detection adapts but takes time to calibrate for new sensors and generates false positives during sudden legitimate regime changes.
Protobuf vs. JSON for sensor payloads: Protobuf reduces payload size by 5-10× compared to JSON (100 bytes vs. 500 bytes per reading) and is significantly faster to serialize/deserialize, but requires schema management (registry) and client library support; JSON is universally supported but wastes bandwidth at 50M messages/second.
TimescaleDB vs. InfluxDB vs. Apache Druid: TimescaleDB offers SQL and PostgreSQL compatibility; InfluxDB excels at single-metric time series with a purpose-built storage engine; Druid provides the best query performance for large-scale analytics aggregations. For a mixed workload (operational queries + analytics), TimescaleDB is the best general choice.
In-stream vs. out-of-stream anomaly detection: Running anomaly detection in the Flink stream provides sub-second latency but limits model complexity; offloading to a separate ML inference service (calling a REST API per reading) adds ~50ms latency but enables complex models. For 50M readings/second, the REST API approach would require 50M API calls/second — impractical; in-stream Flink operators are the only viable option at this scale.