System Design: Real-Time GPS Tracking

Requirements

Functional Requirements:

Track location of vehicles/assets with updates every 5 seconds
Display live map with all tracked assets for fleet operators
Geofencing: trigger alerts when assets enter or exit defined zones
Store full location history for replay and analytics
Support both mobile (driver app) and IoT hardware GPS devices
Generate reports: mileage, idle time, speed violations, route deviations

Non-Functional Requirements:

Location updates reflected on operator dashboard within 3 seconds
Support 500,000 simultaneously tracked assets
Location history retained for 2 years for compliance
System handles intermittent connectivity — devices buffer and replay missed updates
99.9% availability; fleet operators depend on this for logistics operations

Scale Estimation

500,000 assets updating every 5 seconds = 100,000 location events/second. Each event: ~200 bytes (device_id, lat, lng, speed, heading, timestamp, battery) = 20 MB/second inbound. Daily volume: ~86.4 GB/day raw, ~17 GB/day after 5:1 compression. Two-year retention: ~12 TB. A time-series database (InfluxDB or TimescaleDB) with compression achieves ~10:1, bringing storage to ~6 TB. Operator dashboard queries: for a fleet of 500 vehicles, a dashboard refresh every 2 seconds = 250 queries/second (fan-out from aggregation layer).

High-Level Architecture

The system splits into three planes: Ingestion (receiving and validating location events), Storage (persisting raw and aggregated data), and Presentation (serving live maps and historical queries to operators).

Location events arrive via an MQTT broker (EMQ X or AWS IoT Core) for IoT hardware devices, and via HTTPS REST or WebSocket for mobile apps. Both paths land in a Kafka topic partitioned by device_id. A Kafka Streams processing layer performs validation, deduplication, and geofence evaluation in real time. Processed events are fanned out to three sinks: a time-series database for history, a Redis live-state store for current position, and a WebSocket broadcast service for the operator dashboard.

Operators connect to the dashboard via WebSocket. A Dashboard Service subscribes to the processed events Kafka topic filtered by the operator's fleet and pushes updates to connected clients. For large fleets (>1,000 assets), updates are coalesced into batch diffs every 2 seconds rather than per-asset pushes to reduce client rendering load.

Core Components

MQTT / Event Ingestion Layer

IoT GPS devices use MQTT over TLS (port 8883) — a lightweight pub/sub protocol designed for constrained devices and unreliable networks. The MQTT broker cluster (EMQ X, horizontally scaled) handles 500,000 concurrent device connections with QoS level 1 (at-least-once delivery). A bridge plugin forwards all messages from the broker to Kafka. For offline devices, messages are buffered locally and published in order when connectivity resumes with sequence numbers for deduplication.

Geofence Evaluation Engine

Geofences are polygons or circles stored in PostGIS. On each location event, the engine checks if the asset's new position intersects any geofence associated with that asset. A naive approach (PostGIS query per event) doesn't scale at 100,000 events/second. The solution: pre-load all geofences into memory as an R-tree spatial index (using the Python rtree library or a custom Go implementation). The in-memory check is O(log n) and sub-millisecond. Geofence updates (new zones, deletions) propagate to all engine instances via a Redis pub/sub channel.

Time-Series Storage

TimescaleDB (PostgreSQL extension) is used for location history. The hypertable is partitioned by (device_id, time) with automatic chunk compression for chunks older than 7 days (achieving 20:1 compression on GPS data). Continuous aggregates pre-compute hourly rollups (min/max/avg speed, distance traveled) so reporting queries run on compressed rollups rather than raw data. Chunks older than 2 years are automatically dropped by a retention policy.

Database Design

Live state: Redis hash per device (device_id → {lat, lng, speed, heading, last_seen, geofence_state}) with no TTL — fleet management requires persistent current state. Time-series: TimescaleDB hypertable with columns (device_id, timestamp, lat, lng, speed, heading, altitude, battery_pct). Index on (device_id, timestamp DESC) for range queries. Geofences stored in PostGIS (polygon geometry) with a GiST spatial index. Reports are materialized in S3 as Parquet after async computation triggered by the reporting service.

API Design

MQTT topic: devices/{device_id}/location — Device publishes location payload; broker ingests to Kafka
GET /v1/fleet/{fleet_id}/live — WebSocket endpoint; server pushes differential location updates every 2 seconds for all fleet assets
GET /v1/devices/{device_id}/history?start={}&end={} — Returns paginated location history for a time range from TimescaleDB
POST /v1/geofences — Creates a geofence (polygon + associated device_ids + alert configuration); propagates to in-memory R-tree

Scaling & Bottlenecks

The Kafka ingestion layer scales by adding partitions — at 100,000 events/second with 500-byte messages, ~50 MB/second fits comfortably on a 24-partition topic across 3 brokers. The stream processing layer (Kafka Streams) scales by adding consumer instances, one per partition. The geofence engine scales by replicating the in-memory R-tree across all Kafka Streams instances — each instance runs geofence checks independently, so no coordination is needed.

The time-series database is the storage bottleneck at 100,000 inserts/second. TimescaleDB handles ~50,000 inserts/second per node with batching; sharding by device_id across two nodes reaches the target. Alternatively, InfluxDB's edge data replication model can batch writes further. The dashboard WebSocket service scales via Redis pub/sub fan-out: one Kafka consumer writes to a Redis channel per fleet, and N dashboard pods subscribe to their respective channels.

Key Trade-offs

Update frequency vs. cost — 5-second updates cost 12× more than 60-second updates; for low-value assets (trailers in a yard), 60-second is sufficient; mobile workers need 5-second updates for accuracy
MQTT vs. HTTP for IoT — MQTT is 20–40% more bandwidth-efficient and handles reconnects gracefully; HTTP is simpler to debug and works on more networks
In-memory geofence index vs. database query — loading all geofences into memory enables sub-millisecond checks but limits total geofence count to available RAM (~1 million polygons at 1 KB each = 1 GB)
Hot path vs. cold path separation — current location (hot, Redis) and history (cold, TimescaleDB) are separate stores; joining them for reports requires a fan-out query but keeps each store optimized for its access pattern