SYSTEM_DESIGN

System Design: Industrial IoT (IIoT) Platform

Design a scalable Industrial IoT platform for manufacturing environments that collects machine telemetry, detects equipment anomalies, supports predictive maintenance workflows, and integrates with SCADA and MES systems.

16 min readUpdated Jan 15, 2025
system-designiotindustrial-iotpredictive-maintenancescada

Requirements

Functional Requirements:

  • Collect high-frequency telemetry (up to 1 kHz) from industrial machines: vibration, temperature, pressure, current, and flow sensors
  • Real-time anomaly detection: detect equipment anomalies before failure using statistical and ML models
  • Predictive maintenance: generate maintenance work orders when failure probability exceeds threshold
  • Integration with SCADA, MES, and ERP systems via OPC-UA and REST APIs
  • Digital twin: maintain a virtual model of each machine reflecting real-time sensor state
  • Historical analysis: process historians, downtime analysis, and OEE (Overall Equipment Effectiveness) calculation

Non-Functional Requirements:

  • High-frequency telemetry ingestion at up to 1 kHz per sensor; 100k sensors per deployment
  • Anomaly detection latency under 100ms from sensor reading to alert on production line
  • 99.999% availability for edge ingestion (factory cannot stop for cloud connectivity loss)
  • OPC-UA compatibility for legacy SCADA integration without protocol translation gateways
  • Data sovereignty: all raw sensor data can be processed on-premises; cloud is opt-in for ML training

Scale Estimation

100k sensors × 100 Hz average = 10M readings/second per deployment. At 16 bytes/reading (float value + timestamp + sensor ID, compact binary), that's 160 MB/second at the factory edge. For a large enterprise with 100 factories: 100 × 10M = 1B readings/second enterprise-wide, 16 GB/second. On-premises time-series storage at 1 Hz (edge aggregation from 100 Hz to 1 Hz) = 100k readings/second per factory, 10 MB/second. Annual storage per factory at 1 Hz × 100k sensors × 86,400 × 365 = 3.15T readings/year. Compressed at 8 bytes/reading: 25 TB/year/factory — manageable on on-premises NAS.

High-Level Architecture

The IIoT platform uses a hierarchical architecture: Edge → Edge Gateway → Cloud. This hierarchy is essential for industrial environments where network connectivity is unreliable and latency requirements (<100ms) are incompatible with cloud round-trips.

Edge level: sensors connect to edge gateways via industrial protocols (OPC-UA, Modbus, Profinet, EtherNet/IP). The edge gateway (an industrial PC or ruggedized server running the IIoT edge agent) collects data from all connected sensors, applies real-time anomaly detection (streaming statistical model on-device), and stores a local time-series buffer (72 hours of high-resolution data). Critical alerts from on-edge anomaly detection are sent directly to the plant floor SCADA system (OPC-UA server) without cloud involvement.

Edge Gateway level: the gateway aggregates sensor data from 1,000 edge nodes, applies compression (delta encoding + LZ4), and forwards to the plant-level data historian (time-series DB on-premises: OSIsoft PI or InfluxDB). The historian stores 1-Hz aggregated data indefinitely and 100-Hz raw data for 7 days. The gateway also exposes an OPC-UA server endpoint to existing SCADA systems, providing real-time sensor values via the standard industrial protocol without changing existing SCADA configurations.

Cloud level: the cloud platform (optional, for multi-site analytics and ML training) receives data from factory historians via a secure data pipeline (TLS-encrypted, site VPN). The cloud hosts the ML model training service (training predictive maintenance models on historical failure data), the multi-factory analytics dashboard, and the maintenance work order system (ERP integration).

Core Components

Edge Anomaly Detection Engine

The edge anomaly engine runs on the edge gateway with hard real-time constraints. It uses two detection methods: (1) statistical process control (SPC) — maintain control charts (X-bar, R-chart) per sensor; a reading outside 3σ control limits raises an alarm; (2) multivariate anomaly detection — a PCA-based model identifies unusual correlations between related sensors (e.g., elevated vibration + elevated current on the same motor predicts bearing failure). The PCA model is trained in the cloud on historical data and deployed to the edge as a compact model binary (coefficients matrix). Model inference at 100 Hz for 1,000 sensors = 100k inferences/second, achievable on an 8-core industrial PC using SIMD-optimized matrix operations (Intel MKL). Alarms are published to the local OPC-UA server and to a Redis stream on the gateway for plant system consumption.

Digital Twin Service

Each machine has a digital twin: a virtual model reflecting the machine's real-time state, configuration parameters, and predicted health. The twin is defined using a model schema: machine type, component hierarchy (motor, bearing, gearbox), sensor mappings (which sensor measures which component), and normal operating envelopes. The digital twin state is maintained in Redis (real-time sensor values, anomaly flags, predicted remaining useful life) and persisted to PostgreSQL for history. The twin's RUL (Remaining Useful Life) estimate is updated every 5 minutes by a Flink job consuming sensor aggregates from the historian and running a regression model (trained per machine class). Maintenance engineers view the digital twin dashboard to see component health scores and upcoming maintenance recommendations.

OPC-UA Integration Layer

OPC-UA (OPC Unified Architecture) is the dominant industrial IoT protocol, supported by virtually all SCADA, PLC, and historian vendors. The platform includes an OPC-UA server (exposing sensor data and device commands) and an OPC-UA client (reading data from legacy PLC/SCADA systems). The server exposes sensor values as OPC-UA nodes organized in the machine hierarchy namespace. SCADA systems subscribe to node value changes via OPC-UA subscriptions (publish-subscribe mode with configurable deadband to filter noise). The OPC-UA client connects to existing PLC OPC-UA servers and bridges their data into the platform's Kafka bus, enabling unified analytics without replacing legacy infrastructure.

Database Design

Edge (on-premises per factory): InfluxDB or OSIsoft PI for time-series (100-Hz retention for 7 days, 1-Hz retention for 3 years), Redis for real-time sensor values (digital twin state) and anomaly deduplication. PostgreSQL: machines (machine_id, factory_id, model, components_json, installed_at), sensors (sensor_id, machine_id, type, unit, calibration_coefficients, normal_range_json), maintenance_events (event_id, machine_id, component, maintenance_type, occurred_at, technician_id, notes). Cloud: ClickHouse for multi-factory analytics (replicated subsets of factory historians, 1-Hz data), Redshift for operational reporting and ML training dataset management, S3 for raw data archives and model artifacts. Kafka (cloud): factory-telemetry-{factory_id} (aggregated 1-Hz data stream from each factory).

API Design

  • OPC-UA Server /machines/{machine_id}/* — standard OPC-UA node browse and subscription; used by SCADA systems
  • GET /machines/{machine_id}/twin — returns digital twin state: current sensor values, component health scores, predicted RUL, active alarms
  • GET /machines/{machine_id}/history?sensor={s}&from={ts}&to={ts}&resolution={raw|1m|1h} — time-series query from historian
  • POST /maintenance/work-orders — body: {machine_id, component, recommended_action, predicted_failure_date}, creates work order in connected ERP system via REST integration
  • GET /analytics/oee?factory_id={f}&from={date}&to={date} — returns Overall Equipment Effectiveness metrics (availability, performance, quality) calculated from historian data
  • GET /analytics/failures/prediction?factory_id={f}&lookahead_days=30 — returns list of machines predicted to require maintenance in next 30 days with confidence scores*

Scaling & Bottlenecks

Edge throughput at 10M readings/second per factory is the primary constraint. An 8-core edge gateway processes 10M operations/second for simple threshold detection. Multivariate anomaly detection (PCA inference) at 100k inferences/second per gateway requires 400ms of total CPU time per second on a single core using optimized matrix operations — feasible on a 2-core allocation within the 8-core gateway. The historian write path (after 100:1 aggregation from edge nodes to gateway) at 100k writes/second is within InfluxDB's capacity on a 4-node cluster.

Cloud ML training at multi-factory scale: training a vibration-based bearing failure model on 3 years of hourly data from 10k machines requires processing ~263B rows. Apache Spark on a 200-node cluster completes this in ~2 hours, running weekly retraining. Model deployment to 100 factory edge gateways: push the new model binary (typically 10-100 MB) via the secure data pipeline; edge gateways download and hot-reload without service interruption (using a blue-green model swap: new model loaded in parallel, old model continues until swap signal received).

Key Trade-offs

  • Edge processing vs. cloud processing: Processing at the edge provides <100ms latency and survives network outages (essential for safety), but limits model complexity and requires model deployment to constrained hardware; cloud processing supports complex ML models but cannot meet real-time SLAs for production-line safety alerts.
  • OPC-UA vs. MQTT for IIoT: OPC-UA provides a rich information model, security (encrypted/authenticated), and native support for industrial hierarchies — it is the correct choice for factory floor integration; MQTT is simpler and higher throughput but lacks OPC-UA's semantic data model, making it appropriate for cloud transport rather than PLC/SCADA integration.
  • Digital twin granularity: Component-level twins (bearing, motor, gearbox) provide more actionable maintenance insights than machine-level twins but require more sensor infrastructure and more complex models; starting with machine-level twins and adding component-level detail incrementally reduces initial deployment complexity.
  • Open-source historian vs. OSIsoft PI: OSIsoft PI is the industry-standard historian with the best SCADA/PLC integration ecosystem but is expensive ($100k+/site license); InfluxDB or TimescaleDB are viable open-source alternatives for new deployments at significantly lower cost, with growing ecosystem support.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.