System Design: IoT Data Ingestion Platform

Requirements

Functional Requirements:

Receive telemetry data from millions of IoT devices over MQTT (primary), HTTP/REST, and CoAP protocols
Device authentication via X.509 certificates and pre-shared keys (PSK)
Message routing: route device messages to appropriate downstream consumers based on device type, topic, and rules
Device provisioning: new devices self-register and receive credentials without manual intervention
Command and control: send commands from cloud to device (bidirectional communication)
Data validation: schema validation and outlier detection before data reaches downstream systems

Non-Functional Requirements:

Ingest 10 million messages per second at peak
Support 100 million connected devices with persistent MQTT sessions
Message latency from device to downstream consumer under 500ms at p95
Zero message loss for QoS 1 and QoS 2 MQTT messages
99.99% availability (less than 1 hour downtime per year)

Scale Estimation

100M connected devices, with 10% actively publishing at any time = 10M publishing devices. At 1 message/second average publish rate, that's 10M messages/second. With average message size of 200 bytes, that's 2 GB/second of inbound data. MQTT broker memory for persistent sessions: 100M sessions × 1 KB session state (subscriptions, pending QoS queues) = 100 GB — requires a distributed MQTT broker cluster. Each MQTT broker node handles ~100k concurrent connections (standard MQTT broker limit), so 1,000 broker nodes are needed for 100M devices. With 20% spare capacity, provision 1,200 broker nodes.

High-Level Architecture

The platform uses a tiered architecture: a device-facing MQTT/HTTP ingestion tier, a message routing and enrichment tier, and a downstream fan-out tier. The ingestion tier consists of a horizontally scaled MQTT broker cluster (VerneMQ, EMQ X, or HiveMQ) fronted by a Layer 4 load balancer (NLB) that provides stable long-lived TCP connections. MQTT's persistent session semantics (QoS 1/2 guarantee delivery) require session state to be shared across broker nodes — this is handled by a distributed session store (Redis Cluster) that all broker nodes write to and read from, enabling any broker to serve any device's messages.

The routing tier uses an Apache Kafka cluster as the backbone. All inbound device messages (after authentication and schema validation) are published to Kafka topics organized by device type and region. A rules engine service (consuming from Kafka) evaluates configurable routing rules (JSON-based DSL) against each message and routes copies to appropriate downstream Kafka topics: time-series storage topic, stream processing topic, alert evaluation topic, and raw archive topic. The rules engine scales horizontally by Kafka partition — more partitions = more parallel rule evaluation.

Downstream fan-out: time-series topic is consumed by TimescaleDB or InfluxDB writer services (batching writes for efficiency), archive topic is consumed by an S3 writer (Parquet files, hourly), and stream processing topic is consumed by Flink jobs for real-time anomaly detection and aggregation. The result is a fully decoupled pipeline where ingestion rate is independent of downstream processing speed — Kafka provides the buffer.

Core Components

MQTT Broker Cluster

The MQTT broker cluster uses VerneMQ with a Redis-backed distributed session store. Device connections are distributed across brokers via consistent hashing on device_id (ensuring a device always reconnects to the same broker shard for session continuity). When a device publishes a message, the broker authenticates the client certificate against the device registry (Redis cache, populated from PostgreSQL device database), validates the message schema (Avro/JSON schema), and publishes to Kafka. QoS 1 messages are acknowledged to the device only after successful Kafka publication (at-least-once delivery guarantee). QoS 2 (exactly-once) is supported but discouraged for high-frequency telemetry due to 2× round-trip cost.

Device Provisioning Service

Zero-touch device provisioning uses a two-phase enrollment. Phase 1 (factory provisioning): at manufacture time, devices receive a unique device ID and a bootstrap certificate (a short-lived cert signed by the manufacturer CA). Phase 2 (fleet provisioning): on first boot, the device connects to the provisioning endpoint using its bootstrap certificate and sends its device model, firmware version, and hardware ID. The provisioning service validates the bootstrap cert, looks up the device model in the product registry, assigns the device to the appropriate device group, generates a long-lived operational X.509 certificate signed by the IoT platform CA, and returns it to the device along with its assigned MQTT broker endpoint. The device stores the operational cert and uses it for all subsequent connections. Provisioning state is tracked in PostgreSQL.

Rules Engine

The rules engine evaluates routing rules per message. Rules are defined in a simple DSL: {if: {device_type: "temperature_sensor", topic: "telemetry"}, then: [{route_to: "timeseries_ingest"}, {alert_if: "payload.temperature > 80"}]}. Rules are stored in PostgreSQL and cached in the rules engine worker memory. Hot reload: when rules change (new rule added by an operator), a Kafka rules-update message is published; all rule engine workers consume the update and reload their rule cache within 1 second. The engine evaluates rules against each message using a compiled condition tree (not interpreted JSON) — compiled on cache load using a JIT expression evaluator.

Database Design

PostgreSQL: devices (device_id, fleet_id, model, firmware_version, status, provisioned_at, last_seen_at, attributes_json), device_certificates (cert_id, device_id, cert_serial, issued_at, expires_at, revoked_at), fleets (fleet_id, org_id, name, routing_rules_json), routing_rules (rule_id, fleet_id, condition_json, actions_json, priority, enabled). Redis Cluster: device:{device_id}:session (MQTT session state), device:{device_id}:auth (cached auth result, TTL 1h), device:{device_id}:shadow (device shadow/digital twin current state). TimescaleDB: device_telemetry (device_id, metric_name, value, unit, recorded_at) — hypertable partitioned by time, compressed after 7 days. S3: raw message archive in Parquet, partitioned by fleet/date.

API Design

MQTT {fleet_id}/devices/{device_id}/telemetry — devices publish telemetry; broker validates cert, publishes to Kafka
MQTT {fleet_id}/devices/{device_id}/commands — cloud publishes commands; broker delivers to device; QoS 1 for delivery guarantee
POST /provision (HTTPS, bootstrap cert auth) — body: {device_id, model, firmware_version}, returns {operational_cert_pem, broker_endpoint, device_shadow_url}
GET /devices/{device_id}/shadow — returns current device shadow (last known state + desired state delta)
PATCH /devices/{device_id}/shadow/desired — body: desired state JSON, publishes command to device via MQTT, updates shadow
POST /devices/{device_id}/certificates/revoke — revokes device cert, adds serial to CRL, device cannot reconnect

Scaling & Bottlenecks

MQTT broker connection capacity is the primary scaling limit: 100M persistent sessions require 1,000 broker nodes. Use VerneMQ's clustering mode with Redis for distributed session state, or use a modern broker like EMQX (which stores sessions in an internal Mnesia/distributed Erlang store). Network bandwidth per broker: 100k devices × 1 message/second × 200 bytes = 20 MB/second per broker, well within a 10 Gbps NIC. The connection establishment rate during a mass reboot event (e.g., a firmware update that restarts all devices) is the real spike: 100M devices reconnecting within 5 minutes = 333k connections/second — require exponential backoff with jitter in device firmware to spread reconnection over 30 minutes.

Kafka throughput at 10M messages/second with 200-byte messages = 2 GB/second. A 100-broker Kafka cluster with 3× replication and 1,000 partitions handles this comfortably. The time-series database write path is the downstream bottleneck: TimescaleDB on 50 nodes, each receiving batched writes of 10k rows/second, handles the throughput with compression enabled (10× compression for sensor data reduces storage to 20 GB/day for the full pipeline).

Key Trade-offs

MQTT vs. HTTP for IoT: MQTT's persistent connections and publish-subscribe model are more efficient for high-frequency, low-payload telemetry (avoids TCP connection overhead per message); HTTP is simpler for infrequent, high-payload messages and works through all proxies. Offering both and routing to the same Kafka backend provides flexibility.
QoS 0 vs. QoS 1: QoS 0 (fire-and-forget) has 3× higher throughput than QoS 1 (acknowledged delivery) but accepts message loss; for non-critical telemetry (temperature sensors updating every second), QoS 0 is acceptable — a missed reading is trivially interpolated. For critical events (alarms, commands), QoS 1 is mandatory.
X.509 certs vs. JWT tokens: X.509 provides mutual TLS (device-to-cloud and cloud-to-device authentication) at the transport layer without application-level token management; JWTs are simpler to implement but require application-layer validation and token rotation. X.509 with hardware-backed private keys (TPM) is the industry standard for production IoT.
TimescaleDB vs. InfluxDB: TimescaleDB extends PostgreSQL (familiar SQL, strong consistency, replication via Postgres tools); InfluxDB is purpose-built for time series with a more compact storage engine. For teams already operating PostgreSQL, TimescaleDB reduces operational overhead; InfluxDB wins on raw time-series query performance.