System Design: Delivery ETA Prediction

Requirements

Functional Requirements:

Predict total delivery time from order placement to food arrival at customer's door
Break down ETA into components: restaurant prep time, driver pickup wait, drive time to customer
Update ETA in real-time as the delivery progresses (driver assigned, food picked up, en route)
Provide ETA estimates at browse time (before order placement) for restaurant cards
Historical accuracy tracking and model performance monitoring
Support different transport modes (car, bicycle, scooter, walking)

Non-Functional Requirements:

ETA prediction latency under 50ms per request
Accuracy target: 80% of deliveries arrive within 5 minutes of predicted ETA
Process 100K ETA requests/sec during peak (browse-time estimates for every restaurant card)
Model retraining daily; real-time feature updates within 30 seconds of signal generation
Graceful degradation to heuristic-based estimates if ML model is unavailable

Scale Estimation

Browse-time ETA: 50M DAU × 20 restaurant cards viewed per session × 3 sessions/day = 3B ETA requests/day = 34,700 requests/sec average, peaking at 100K/sec during dinner rush. Active delivery ETA updates: 2M concurrent deliveries × 1 update every 30 seconds = 67K updates/sec. Feature computation: each ETA prediction requires 50+ features (restaurant-specific, driver-specific, route-specific, temporal, weather) computed from multiple data sources. Model inference: a gradient-boosted tree ensemble (LightGBM) processes 50 features per prediction in ~2ms on CPU, well within the 50ms latency budget.

High-Level Architecture

The ETA system has three operational modes: browse-time prediction (before order), order-time prediction (at placement), and live prediction (during delivery). Each mode uses different feature sets and models optimized for that stage. Browse-time predictions use aggregate features (average prep time for the restaurant, estimated drive time from restaurant to user, current zone congestion level) and are served from a pre-computed cache. Order-time predictions add order-specific features (item count, item complexity, current restaurant queue depth). Live predictions incorporate real-time driver location and speed.

The ML pipeline follows a standard offline-online architecture. Offline: daily Spark jobs process historical delivery data to extract training features, train the LightGBM model, evaluate against a holdout set, and promote to production if accuracy improves. Online: a Feature Store (Redis for real-time features, Feast for batch features) serves features to the Model Serving layer (a custom HTTP service wrapping LightGBM inference). A Flink streaming job continuously updates real-time features: current restaurant queue depth (how many active orders at the restaurant right now), zone-level traffic speed (aggregated from driver GPS data), and weather conditions (fetched from a weather API every 15 minutes).

A critical component is the feedback loop: after every delivery completes, the actual times (prep time, pickup wait, drive time, total time) are computed from the order event log and compared against predictions. Prediction errors are logged to a monitoring dashboard and fed back into the training dataset. A drift detection system alerts if the model's mean absolute error exceeds the target threshold for any restaurant segment or time window.

Core Components

Restaurant Prep Time Model

Prep time is the largest source of ETA error — it varies by restaurant, order complexity, current queue depth, and time of day. The prep time model is a separate LightGBM model with features: restaurant_id embedding (learned from historical data), number of items in order, item complexity score (computed from modifier count and prep category — e.g., grilled items take longer than cold items), current active orders at the restaurant (from the Order Service), time_of_day, day_of_week, and a restaurant-specific mean_prep_time (rolling 7-day average). The model outputs a point estimate and a confidence interval (10th to 90th percentile). For new restaurants with insufficient history (<50 orders), a fallback uses cuisine-category averages (e.g., pizza restaurants average 18 minutes, sushi restaurants average 25 minutes).

Route Time Estimation Engine

Drive time prediction uses a graph-based routing engine (OSRM — Open Source Routing Machine) enhanced with real-time traffic overlays. The base route is computed by OSRM using the OpenStreetMap road network. Travel times for each road segment are then adjusted using real-time speed data derived from the driver fleet: aggregated GPS speeds from all drivers on each road segment in the last 15 minutes, stored in a Redis hash traffic:{road_segment_id} → current_speed_kmh. Historical speed profiles (average speed by road segment, hour-of-day, and day-of-week) fill in gaps where no recent driver data exists. For bicycle and scooter deliveries, separate speed profiles account for bike lanes, one-way restrictions that don't apply to bicycles, and elevation (hills significantly impact bicycle speed). The routing engine pre-computes travel time matrices between popular origin-destination pairs (restaurant H3 cells to customer H3 cells) and caches results in Redis with a 10-minute TTL.

Live ETA Update Pipeline

Once a delivery is active, the ETA is continuously refined as new information arrives. The pipeline operates as a state machine with three phases: (1) Pre-pickup: ETA = remaining_prep_time + driver_travel_to_restaurant + drive_time_to_customer. As driver location updates arrive, driver_travel_to_restaurant is recalculated by snapping the driver's position to the route and computing remaining distance. (2) At restaurant: ETA = remaining_prep_time (if food not ready) + drive_time_to_customer. The system detects the driver's arrival via geofence and switches to this phase. (3) En route: ETA = remaining_drive_time, calculated from the driver's current position on the route. A Flink job processes the driver location stream (keyed by delivery_id), maintains the current phase state, and emits updated ETAs to a Kafka topic consumed by the consumer-facing tracking service.

Database Design

The Feature Store uses a dual-layer architecture. The real-time layer in Redis stores features that change frequently: rt_features:{restaurant_id} → hash containing active_order_count, avg_recent_prep_time (rolling 1-hour), current_queue_items, and last_updated. rt_features:{zone_id} → hash containing avg_traffic_speed, congestion_level, weather_code, and surge_multiplier. Driver-specific features: rt_features:{driver_id} → hash with current_speed, heading, and active_delivery_count. Each key has a 5-minute TTL with refresh on update.

The batch layer stores historical features in PostgreSQL: restaurant_features table with restaurant_id, cuisine_type, avg_prep_time_7d, std_prep_time_7d, avg_prep_time_by_hour (JSONB array of 24 values), order_volume_by_hour, and updated_at. A delivery_actuals table logs ground truth: delivery_id, predicted_total_eta, actual_total_time, predicted_prep_time, actual_prep_time, predicted_drive_time, actual_drive_time, prediction_error, model_version, and feature_snapshot (JSONB blob of all features used at prediction time for debugging). This table is the primary training data source, growing at 15M rows/day.

API Design

GET /api/v1/eta/browse?restaurant_id={id}&customer_lat={lat}&customer_lng={lng} — Browse-time ETA estimate; returns estimated_minutes, confidence_range (min, max), breakdown (prep_minutes, drive_minutes)
POST /api/v1/eta/order — Order-time ETA with full feature set; body contains order_id, restaurant_id, items, customer_location; returns detailed prediction with component breakdown and confidence interval
GET /api/v1/eta/live/{delivery_id} — Current live ETA for an active delivery; returns eta_minutes, phase (preparing/at_restaurant/en_route), driver_location, last_updated
GET /api/v1/eta/accuracy?restaurant_id={id}&window=7d — ETA accuracy metrics for a restaurant; returns mean_absolute_error, within_5min_percentage, bias (consistently early/late)

Scaling & Bottlenecks

The browse-time ETA endpoint is the highest throughput path at 100K requests/sec. Each request requires a feature lookup (Redis) and model inference (LightGBM, ~2ms CPU). A fleet of 50 inference servers (each handling 2K requests/sec) handles this load. To further reduce latency, browse-time ETAs for popular restaurant-zone pairs are pre-computed every 5 minutes by a batch job and stored in Redis: eta_cache:{restaurant_id}:{customer_h3} → estimated_minutes. Cache hit rate is ~70%, reducing inference load by 70K requests/sec.

The OSRM routing engine is CPU-intensive for route computation. Pre-computing travel time matrices for all restaurant-to-customer-zone pairs is infeasible (900K restaurants × 100K H3 cells = 90B pairs). Instead, the system pre-computes the top 500 most frequent pairs per restaurant (covering ~85% of orders) and falls back to live OSRM queries for uncached pairs. The OSRM cluster (10 instances) handles the ~15K fallback queries/sec with p99 latency under 30ms. Real-time traffic overlay updates (aggregating driver speeds per road segment) run on a separate Flink pipeline that processes 500K location events/sec and updates 2M road segment speeds in Redis.

Key Trade-offs

LightGBM over deep neural networks for ETA prediction: Gradient-boosted trees provide interpretable feature importances (critical for debugging why a specific restaurant's ETA is consistently wrong), fast CPU inference (~2ms vs ~20ms for a neural network), and robust performance with tabular features — neural networks could capture more complex interactions but the accuracy gap is marginal (<2% MAE improvement) for significantly higher serving cost
Separate prep time and drive time models over a single end-to-end model: Decomposing the prediction into components allows independent monitoring and debugging ("is prep time or drive time causing inaccuracy?"), but introduces error compounding where individual model errors accumulate — mitigated by a thin calibration layer that adjusts the sum based on historical total-time data
Pre-computed ETA cache for browse-time over live inference: Caching reduces inference load by 70% and ensures sub-10ms response times for browse-time estimates, but cached ETAs can be stale by up to 5 minutes — acceptable for browse-time where ±3 minute accuracy is sufficient
Real-time traffic from fleet GPS over third-party traffic APIs: Using the driver fleet as a traffic sensor provides road-segment-level granularity specific to delivery routes (side streets, restaurant areas), while third-party APIs provide highway-focused data — the fleet data is free but sparse in low-driver-density areas, where the system falls back to historical speed profiles