System Design: Fleet Management System

Requirements

Functional Requirements:

Real-time tracking of all vehicles in the fleet on a live map
Automated maintenance scheduling based on mileage, engine hours, and OBD-II data
Driver behavior scoring: harsh braking, acceleration, speeding, phone use detection
Compliance reporting: hours-of-service (HOS), IFTA fuel tax, DVIRs
Fuel monitoring: consumption per vehicle, anomaly detection for fuel theft
Alerts: geofence violations, speeding, harsh events, vehicle health warnings

Non-Functional Requirements:

Track 100,000 vehicles with 10-second update frequency
Maintenance alerts delivered within 5 minutes of threshold breach
Compliance reports generated within 60 seconds for any date range
99.9% uptime; fleet operators use this for legal compliance
Data retained 7 years for DOT audit requirements

Scale Estimation

100,000 vehicles × 10-second updates = 10,000 events/second. Each telemetry event: ~500 bytes (lat, lng, speed, heading, RPM, fuel level, odometer, OBD codes, timestamp) = 5 MB/second. Daily volume: ~432 GB/day raw; with LZ4 compression ~86 GB/day. 7-year retention: ~220 TB — necessitates tiered storage (hot: NVMe SSD for last 30 days, warm: HDD for last 2 years, cold: S3 Glacier for 2–7 years). Driver behavior events (harsh events) ~5% of telemetry = 500 events/second requiring near-real-time scoring.

High-Level Architecture

The fleet management platform ingests telemetry from in-vehicle OBD-II/ELD devices via cellular (4G LTE) using MQTT or a proprietary TCP protocol. Data flows through an ingestion pipeline into specialized processors for tracking, behavior scoring, maintenance, and compliance.

OBD-II/ELD devices connect to a Device Gateway cluster (EMQ X MQTT broker for scale), which forwards messages to Kafka. A Telemetry Processing Service consumes from Kafka and fans data out to multiple downstream consumers: the Live Tracking Service (updates Redis + pushes to operator dashboards), the Behavior Scoring Engine (detects and scores harsh events), the Maintenance Monitor (evaluates maintenance thresholds), and the Compliance Service (maintains HOS logs and fuel records).

Operator dashboards connect via WebSocket for live updates. Reports are generated asynchronously by a Reporting Service that queries TimescaleDB (time-series telemetry) and PostgreSQL (compliance records) and caches results in S3.

Core Components

Telemetry Ingestion & Processing

ELD/OBD devices publish to MQTT topics organized by fleet_id and vehicle_id. The Device Gateway authenticates devices via mutual TLS (device certificates issued at provisioning time). Kafka topics are partitioned by vehicle_id ensuring all messages for a vehicle land on the same partition for ordered processing. The Telemetry Processor enriches events: reverse geocoding (lat/lng → address) using a local geocoder cache (OpenStreetMap Nominatim), odometer gap detection (flags > 100 km jump as sensor error), and speed validation against posted speed limits.

Driver Behavior Scoring Engine

The scoring engine uses a sliding window stream processor (Flink) that evaluates each telemetry event for harsh events. Harsh braking: speed drop > 0.4g in 1 second. Harsh acceleration: speed increase > 0.4g in 1 second. Speeding: GPS speed > posted speed limit + 5 mph for > 10 seconds. Phone use proxy: irregular steering micro-corrections correlated with swerving patterns (requires gyroscope data). Each event updates the driver's weekly score (0–100) stored in Redis, triggering a Kafka event for alerting if threshold breaches occur. Scores are persisted to PostgreSQL for trend reporting.

Maintenance Monitor

The Maintenance Monitor maintains per-vehicle maintenance schedules in PostgreSQL: (vehicle_id, service_type, last_service_odometer, service_interval_miles, last_service_date, service_interval_days). After each telemetry event updates the odometer reading in Redis, the monitor checks: if (current_odometer - last_service_odometer) >= service_interval_miles OR (today - last_service_date) >= service_interval_days, it creates a maintenance alert. DTC (Diagnostic Trouble Code) codes in OBD-II data immediately trigger high-priority alerts routed to the fleet manager's notification queue.

Database Design

Telemetry time-series in TimescaleDB hypertable partitioned by (vehicle_id, time) with 1-day chunks. Continuous aggregates pre-compute hourly summaries (avg speed, total distance, idle time, fuel consumption). Chunks older than 30 days are compressed (20:1 for GPS data). Chunks older than 2 years are detached and archived to S3 Parquet via pg_partman. Compliance records (HOS logs, DVIR inspections, fuel logs) in PostgreSQL with strict ACID semantics and 7-year retention. Driver behavior scores in PostgreSQL with a time-partitioned index for weekly/monthly trend queries.

API Design

MQTT topic: fleet/{fleet_id}/vehicle/{vehicle_id}/telemetry — Device publishes telemetry; gateway validates and forwards to Kafka
GET /v1/fleet/{fleet_id}/live — WebSocket; pushes differential position updates for all fleet vehicles every 10 seconds
GET /v1/vehicles/{vehicle_id}/hos — Returns current hours-of-service status for ELD compliance; must respond in <500ms for roadside inspection
POST /v1/reports/generate — Async report generation; accepts report_type, vehicle_ids, date_range; returns report_id for polling

Scaling & Bottlenecks

The MQTT broker is the entry-point bottleneck: 100,000 persistent connections × keep-alive traffic. EMQ X scales to 1 million connections per node cluster with connection state in an internal Mnesia distributed database. Kafka handles the 10,000 events/second throughput on a 10-partition topic with 3 replicas (100 MB/second total). TimescaleDB handles 10,000 inserts/second per node with batching enabled (batch 1,000 rows before flush); a two-node setup provides sufficient throughput and read replicas serve dashboard queries.

Compliance reporting is the query bottleneck: HOS reports must scan all telemetry for a vehicle over weeks. Pre-computed daily summaries (total driving time, on-duty time, off-duty time) in a separate PostgreSQL compliance_daily_summary table make these queries O(days) rather than O(telemetry_events). The Reporting Service generates reports asynchronously and caches in S3 with a 1-hour TTL, returning cached versions for repeated queries.

Key Trade-offs

Cellular vs. satellite connectivity — satellite (Globalstar, Iridium) provides coverage in remote areas but has 10–30 second latency and high cost; most fleet systems use 4G LTE with satellite as fallback only for off-road operations
On-device processing vs. cloud processing — processing driver behavior on-device reduces bandwidth but requires OTA firmware updates for algorithm changes; cloud processing is more flexible but requires raw data upload
ELD certification requirements — DOT-certified ELDs must meet specific tamper-proofing and data integrity requirements (FMCSA §395.15); this constrains architecture choices for the compliance data pipeline
7-year retention cost — S3 Glacier Deep Archive at $0.00099/GB/month makes 220 TB cost ~$218/month — inexpensive enough to retain everything rather than sample