System Design: IoT Analytics Platform

Requirements

Functional Requirements:

Aggregate and analyze telemetry from millions of IoT devices over arbitrary time windows
Support ad-hoc queries: operators can query device data with custom filters (device group, region, metric type, time range)
Pre-built dashboards: fleet health overview, device uptime SLAs, resource utilization trends
Alerting: configurable alerts on aggregated metrics (e.g., "average temperature across fleet > 80°C for 10 minutes")
Predictive analytics: trend extrapolation and anomaly forecasting for proactive maintenance
Data export: CSV/Parquet export of query results for external tools

Non-Functional Requirements:

Ad-hoc queries over 1 year of data for 10 million devices must return in under 10 seconds
Dashboard refresh latency under 5 seconds for pre-built views
Support 1,000 concurrent analysts running queries simultaneously
Data ingested from IoT pipeline available for analytics within 5 minutes of device measurement (near-real-time)
Retain raw data for 90 days; aggregated data for 3 years

Scale Estimation

10M devices × 10 readings/second = 100M readings/second going into the IoT pipeline. The analytics platform ingests from the pipeline's processed stream — not raw, but validated and transformed readings. At 1 reading/second per device after aggregation = 10M rows/second. Daily volume: 10M × 86,400 = 864B rows/day. Compressed in Parquet at ~50 bytes/row: 43 TB/day. Over 90 days (raw): 3.9 PB. An ad-hoc query scanning 1 year of data for one device metric = 10M rows; across all devices = 3.15T rows — requires columnar storage with aggressive partitioning and pruning.

High-Level Architecture

The analytics platform is built on a Lambda architecture with a hot tier (real-time, last 7 days) and a cold tier (historical, 7 days to 3 years). The hot tier uses a columnar time-series database (Apache Druid or ClickHouse) for sub-second queries on recent data. The cold tier uses S3-backed Parquet files queryable via Apache Spark or Presto/Trino for historical ad-hoc analysis. Pre-built dashboards query the hot tier; ad-hoc historical queries go to the cold tier. A query router service directs queries to the appropriate tier based on the time range requested.

Data flows from the IoT pipeline's Kafka output topic into the analytics platform via two paths: (1) a ClickHouse Kafka consumer writes raw device readings to ClickHouse's distributed MergeTree table (partitioned by device_id and date) — available for queries within 5 seconds of ingestion; (2) an S3 archiver writes Parquet files hourly (from Kafka using Apache Flink) to S3, partitioned by date/device_group/metric. ClickHouse retains 7 days; S3 retains 3 years. Queries older than 7 days are automatically routed to the Presto query engine over S3.

Pre-built dashboards use a materialized view layer: ClickHouse materialized views compute per-fleet, per-region, and per-device-group aggregates in real time (refreshed every 1 minute). Dashboard queries read from these materialized views rather than scanning raw data — sub-second response times regardless of fleet size.

Core Components

ClickHouse Ingestion Layer

ClickHouse's native Kafka engine table reads directly from Kafka topics (no separate consumer process needed). A materialized table stores incoming readings with a ReplicatedMergeTree engine, partitioned by (toYYYYMM(recorded_at), device_id). ClickHouse's columnar storage with LZ4 compression achieves ~10:1 compression for sensor time-series data. The ingestion rate of 10M rows/second is handled by a ClickHouse cluster of 20 nodes with 3× replication, using asynchronous inserts (ClickHouse buffers rows in memory and flushes in larger batches to avoid excessive write amplification from small inserts).

Query Router and Optimizer

The query router accepts SQL-like queries from the dashboard API and analyst tools. It parses the time range predicate to determine if the query can be fully served from ClickHouse (<7 days), must go to S3 (>7 days), or spans both tiers. For cross-tier queries, the router executes both sub-queries in parallel, merges the results (ClickHouse for recent, Presto for historical), and returns the unified result set. The router also applies automatic partition pruning hints: for queries filtered by device_group or region, it adds the corresponding Parquet partition path prefix to the Presto query, reducing data scanned by orders of magnitude.

Alerting Engine

The alerting engine evaluates fleet-level alert rules (e.g., "avg(temperature) across fleet_id=X > 80 for 10 minutes") by querying ClickHouse every 60 seconds. Alert rules are stored in PostgreSQL with evaluation frequency, threshold, aggregation function, and notification targets. The engine maintains alert state (firing/resolved) in Redis to implement alert hysteresis (an alert fires only when condition holds for the full duration, resolves only after 2 consecutive normal evaluations — preventing flapping). Notifications are sent via PagerDuty, Slack webhooks, or email using the notification gateway service.

Database Design

ClickHouse: device_readings (device_id, metric_name, value, unit, quality, recorded_at) — ReplicatedMergeTree, partitioned by (month, device_id), sorted by (device_id, recorded_at) for range scans; fleet_aggregates_1m (fleet_id, metric_name, avg_value, min_value, max_value, device_count, bucket) — materialized view, refreshed every 1 minute. S3: Parquet files at s3://analytics-bucket/{year}/{month}/{day}/{device_group}/{metric_name}/data.parquet, registered as external tables in AWS Glue catalog for Presto/Athena queries. PostgreSQL: alert_rules (rule_id, fleet_id, condition_json, duration_minutes, notification_config_json, status), alert_events (event_id, rule_id, fired_at, resolved_at, peak_value), query_history (query_id, analyst_id, query_text, duration_ms, rows_scanned, executed_at). Redis: alert:state:{rule_id} (current alert state and consecutive-evaluation counter).

API Design

POST /query — body: {sql, time_range, output_format}, routes to ClickHouse or Presto, returns {columns, rows, duration_ms, rows_scanned}; long-running queries return job_id for async polling
GET /dashboards/{dashboard_id} — returns pre-built dashboard data from ClickHouse materialized views; cached in Redis for 60 seconds
GET /devices/{device_id}/metrics?metric={m}&from={ts}&to={ts}&resolution={auto|1m|1h|1d} — returns time-series for a specific metric, auto-resolving resolution based on time range
POST /alerts/rules — body: {fleet_id, condition, duration_minutes, notification_channels[]}, creates alert rule
POST /export — body: {query, format: parquet|csv}, triggers async export to S3, returns download URL when complete

Scaling & Bottlenecks

ClickHouse write throughput at 10M rows/second: ClickHouse can ingest ~1M rows/second per node with async batch inserts. 20 nodes × 1M rows/second = 20M rows/second capacity — 2× headroom. The bottleneck shifts to Kafka consumer throughput (ClickHouse Kafka engine processing). With 200 Kafka partitions and 20 ClickHouse nodes each consuming 10 partitions, each node processes 500k rows/second from Kafka — within limits.

Ad-hoc queries on S3 via Presto: a query scanning 1 year of data for 10M devices = 3.15T rows. Presto with 500 workers, each scanning 6.3B rows/second (fast NVMe SSD or S3 intelligent tiering), completes in ~10 seconds — at the SLA boundary. Partitioning by device_group significantly reduces scan: most ad-hoc queries filter by fleet or device group (1,000 groups), scanning 1/1,000th of the data = 3.15B rows per typical query, completing in ~1 second.

Key Trade-offs

ClickHouse vs. Apache Druid: Both are columnar OLAP databases. Druid excels at time-series rollups with pre-aggregation (fast dashboard queries) but less flexible for ad-hoc SQL; ClickHouse handles both use cases well with better SQL compatibility and simpler operations. For pure time-series analytics, either works.
Near-real-time ingestion (5 min) vs. streaming: True streaming (sub-second analytics) requires Flink → Druid realtime ingestion, adding operational complexity; 5-minute batch ingestion via ClickHouse Kafka engine is operationally simpler and sufficient for fleet health dashboards.
Materialized views vs. on-demand aggregation: Pre-computed materialized views (fleet-level aggregates) make dashboard queries sub-second but consume storage and refresh compute; on-demand aggregation is always current but too slow for fleet-wide queries.
Presto on S3 vs. Redshift for historical: Presto/Athena on S3 uses pay-per-query pricing (no always-on cluster cost) and scales infinitely via S3; Redshift requires a provisioned cluster with reserved capacity pricing but provides faster queries for repeated access patterns. For intermittent historical queries, Athena is more cost-effective.