System Design: Surge Pricing System

Requirements

Functional Requirements:

Calculate and apply dynamic delivery fees based on real-time demand and driver supply
Define geographic pricing zones with independent surge multipliers
Display current surge level to users before order placement
Implement price caps and regulatory constraints per market
Support time-based pricing rules (happy hour discounts, late-night surcharges)
A/B test different pricing strategies across user segments

Non-Functional Requirements:

Surge multiplier recalculation every 60 seconds per zone
Price lookup latency under 20ms for any zone
99.99% availability — pricing service failure blocks all order placement
Consistent pricing within a user session (no price change between viewing fee and placing order)
Audit trail for all pricing decisions for regulatory compliance

Scale Estimation

With 10,000 pricing zones globally (cities subdivided into neighborhoods using H3 hexagons at resolution 5, ~250 km² per hex), each recalculated every 60 seconds, that is 167 zone recalculations/sec. Each recalculation requires aggregating demand (order requests in the last 5 minutes) and supply (available drivers in the zone) — querying a real-time stream processor. Price lookups: every restaurant search and order placement requires a surge check = ~50K lookups/sec during peak. The pricing decision for each lookup must consider zone-level surge, time-of-day rules, user segment, A/B test assignment, and regulatory caps — all within 20ms.

High-Level Architecture

The surge pricing system has three layers: Signal Collection, Price Computation, and Price Serving. The Signal Collection layer continuously aggregates demand and supply signals from Kafka streams. Demand signals include order placement events, search events (users browsing restaurants in a zone), and cart creation events. Supply signals include driver availability updates, driver location changes, and shift-end predictions. These signals are aggregated by a Flink streaming job into per-zone counters with 1-minute tumbling windows.

The Price Computation layer runs a pricing engine every 60 seconds per zone. It consumes the aggregated demand/supply metrics and applies the pricing model: a base delivery fee multiplied by a surge multiplier derived from the demand-to-supply ratio. The model incorporates price elasticity curves (how demand changes in response to price increases) calibrated per market from historical data. The computed surge multiplier is bounded by regulatory caps (e.g., max 3x in some markets) and business rules (new user price protection, loyalty tier discounts). Results are written to a Price Store (Redis) for fast serving.

The Price Serving layer is a lightweight read service that resolves prices for a given location. When a user views a restaurant, the UI Service calls the Price Serving layer with the restaurant's coordinates. The service determines the H3 zone, looks up the current surge multiplier from Redis, applies user-specific adjustments (subscription discounts, A/B test variants), and returns the final delivery fee. A session-based price lock mechanism caches the quoted price for 10 minutes, ensuring the user pays the price they saw at browse time.

Core Components

Real-Time Demand/Supply Aggregator

A Flink streaming application consumes three Kafka topics: order-events (demand), driver-availability (supply), and search-events (leading demand indicator). For each H3 pricing zone, Flink maintains a tumbling window counter of demand events (weighted: order_placed=1.0, cart_created=0.5, search=0.2) and supply count (available drivers currently in the zone). Every 60 seconds, the window closes and emits a zone-metrics event containing zone_id, demand_score, supply_count, demand_supply_ratio, and trend_direction (increasing/decreasing/stable based on comparison with the previous window). The Flink job is keyed by zone_id with 10,000 parallel instances.

Pricing Model Engine

The pricing engine applies a piecewise linear surge function: when demand/supply ratio < 1.0, the surge multiplier is 1.0 (no surge); between 1.0-2.0, the multiplier scales linearly from 1.0x to 1.5x; between 2.0-4.0, it scales from 1.5x to 2.5x; above 4.0, it caps at 3.0x (regulatory maximum). This piecewise function is calibrated per market using price elasticity data — markets where demand is highly elastic (price-sensitive users) have gentler slopes. The engine also applies temporal smoothing: the surge multiplier can increase by at most 0.5x per cycle (preventing jarring price jumps) and decreases are smoothed with exponential decay (half-life of 5 minutes). A/B test variants override specific function parameters — for example, testing a more aggressive slope in a control group to measure revenue impact.

Price Lock & Session Manager

To prevent a frustrating user experience where the delivery fee changes between browsing and checkout, the system implements price locking. When a user first queries the price for a restaurant, the Price Serving layer generates a price_lock_token containing the zone_id, surge_multiplier, delivery_fee, timestamp, and user_id, signed with HMAC-SHA256. This token is returned to the client and must be submitted with the order. The Order Service validates the token signature and checks that the timestamp is within the 10-minute lock window. If the lock has expired, a new price is fetched and the user is shown the updated fee before confirming. The token is stateless (validated by signature verification, not a database lookup) to avoid adding latency or state management overhead.

Database Design

The Price Store uses Redis with a simple key-value structure: surge:{zone_id} maps to a JSON object containing multiplier, base_fee, effective_delivery_fee, demand_score, supply_count, updated_at, and expires_at (120 seconds TTL as a safety net). Zone configuration (base fees, regulatory caps, elasticity parameters) is stored in PostgreSQL in a pricing_zones table with columns: zone_id (H3 index), market_id, base_delivery_fee, max_multiplier, elasticity_curve_params (JSONB), active_experiments (JSONB array), updated_at. This configuration is loaded into the Pricing Engine's memory at startup and refreshed every 5 minutes via a polling mechanism.

For audit and compliance, every pricing decision is logged to a ClickHouse table: zone_id, timestamp, demand_score, supply_count, ratio, raw_multiplier (before caps), final_multiplier, regulatory_cap_applied (boolean), experiment_variant, and config_version. This append-only log supports regulatory inquiries ("why was the fee $12 in Zone X at 7:15 PM on March 3rd?") and provides training data for elasticity model calibration. ClickHouse ingests these events via Kafka Connect at 167 events/sec with efficient columnar compression.

API Design

GET /api/v1/pricing/delivery-fee?lat={lat}&lng={lng}&restaurant_id={id} — Get current delivery fee for a location; returns delivery_fee, surge_multiplier, price_lock_token, expires_at
POST /api/v1/pricing/validate-lock — Validate a price lock token; body contains price_lock_token; returns valid (boolean), delivery_fee, and remaining_ttl_seconds
GET /api/v1/pricing/zones/{zone_id}/history?start={ts}&end={ts} — Retrieve surge history for a zone (internal/admin); returns time-series of multiplier values
PUT /api/v1/pricing/zones/{zone_id}/config — Update zone pricing configuration (admin); body contains base_fee, max_multiplier, elasticity_params

Scaling & Bottlenecks

The pricing system is read-heavy: 50K price lookups/sec vs. 167 price computations/sec. The Redis Price Store handles this comfortably on a 3-node cluster with read replicas. The real bottleneck is the Flink aggregation job during extreme demand events (e.g., Super Bowl Sunday, New Year's Eve) when order event volume spikes 10x. The Flink job's Kafka consumer falls behind, causing stale demand metrics and delayed surge responses. This is mitigated by pre-scaling the Flink job before known events (based on a calendar of predicted high-demand dates) and using Kafka consumer group lag monitoring to trigger auto-scaling.

The price lock token system is intentionally stateless (HMAC-signed, not stored in a database) to avoid a state management bottleneck. However, this means revoked locks (e.g., when a market-wide price correction occurs) cannot be invalidated without checking a revocation list. The compromise is a lightweight Redis set revoked-locks checked during order placement — containing only lock tokens from the last 10 minutes that need to be invalidated, keeping the set small (typically <1,000 entries).

Key Trade-offs

60-second recalculation interval over real-time per-event pricing: Batch pricing every 60 seconds provides stable, predictable pricing and limits computational cost, but may lag behind rapid demand shifts during flash events — the smoothing algorithm helps prevent jarring jumps when the system catches up
Piecewise linear function over ML-based dynamic pricing: A simple parameterized function is interpretable, auditable (critical for regulatory compliance), and fast to compute — ML models could capture more nuanced demand patterns but are black boxes that regulators and users distrust
Stateless price lock tokens over database-stored quotes: HMAC-signed tokens scale infinitely with zero database overhead, but cannot be revoked individually — the revocation list in Redis is a pragmatic compromise for the rare cases where mass invalidation is needed
H3 hexagons over arbitrary polygon zones: H3 provides uniform zone sizes, hierarchical aggregation (zoom in/out by changing resolution), and pre-computed neighbor lists, but zone boundaries don't align with natural neighborhood boundaries — the resolution 5 granularity (~250 km²) is a balance between pricing precision and operational manageability