System Design: Stock Trading Platform

Requirements

Functional Requirements:

Users can place market, limit, stop, and stop-limit orders for equities
Real-time order matching engine executing trades by price-time priority
Live market data streaming (Level 1 quotes, Level 2 order book depth)
Portfolio management with real-time P&L calculation and margin tracking
Integration with exchanges and market makers via FIX protocol (Financial Information eXchange)
Post-trade processing: clearing, settlement (T+1), and regulatory reporting

Non-Functional Requirements:

Order matching latency under 50 microseconds for co-located engines
Support 500,000 orders/sec during market open volatility spikes
Market data distribution to 2M concurrent subscribers with under 10ms latency
99.999% availability during market hours (9:30 AM - 4:00 PM ET)
Full order audit trail with nanosecond timestamps for SEC/FINRA compliance

Scale Estimation

US equity markets process ~12B shares/day across ~100M orders. A retail brokerage handling 5% of retail flow processes 2M orders/day = 23 orders/sec average but 500K orders/sec during the opening bell (9:30-9:35 AM sees 20x normal volume). Market data: 8,000 listed symbols each generating 100 quote updates/sec = 800K messages/sec of raw market data. With 2M subscribers, fanout produces 1.6T messages/day. Portfolio valuation: 10M user accounts × 15 holdings average × real-time price updates = 150M position revaluations per market data tick. Order book: each symbol maintains a price-ordered book with average depth of 500 price levels on each side.

High-Level Architecture

The trading platform is split into three latency tiers. The ultra-low-latency tier contains the Order Matching Engine — a single-threaded, lock-free C++ application running on bare metal with kernel bypass networking (DPDK) and huge pages. It receives orders via a binary protocol over TCP and matches them against the order book using price-time priority (FIFO at each price level). Matched trades emit execution reports to a sequencer that assigns globally ordered sequence numbers.

The low-latency tier handles order routing, risk checks, and market data distribution. The Order Management System (OMS) receives client orders via REST/WebSocket or FIX protocol, performs pre-trade risk checks (buying power, position limits, restricted securities), and routes orders to the matching engine or external exchanges. The Market Data Service subscribes to exchange feeds (SIP, direct feeds) via multicast UDP, normalizes the data, and fans out to subscribers via a pub/sub layer (Aeron for internal distribution, WebSocket for retail clients).

The batch tier handles post-trade processing. After market close, the Clearing Service matches executed trades with the NSCC (National Securities Clearing Corporation) for T+1 settlement. The Reporting Service generates OATS (Order Audit Trail System) and CAT (Consolidated Audit Trail) reports required by FINRA/SEC. A reconciliation engine compares the platform's trade records with exchange confirmations and custodian reports.

Core Components

Order Matching Engine

The matching engine is the heart of the platform. It maintains an in-memory order book per symbol implemented as two sorted structures: a max-heap for bids (buy orders) and a min-heap for asks (sell orders). When a new order arrives, the engine checks if it can be matched against resting orders on the opposite side. A market buy order matches against the lowest ask price; a limit buy matches if the limit price >= lowest ask. Partial fills are supported — a large order may match against multiple resting orders at different price levels. The engine is single-threaded to avoid lock contention (inspired by LMAX Disruptor pattern), processing orders from a ring buffer. State is replicated to a hot standby via a binary journal (write-ahead log) for failover within 50ms.

Market Data Distribution

Market data flows through a three-tier distribution tree. Tier 1: the Market Data Handler receives raw feeds from exchanges via multicast UDP and decodes proprietary binary protocols (ITCH for Nasdaq, PITCH for CBOE). It normalizes data into an internal format and publishes to an Aeron media driver (shared memory IPC). Tier 2: Aggregation Servers consume from Aeron and compute derived data (NBBO — National Best Bid/Offer — by comparing quotes across all exchanges for the same symbol). Tier 3: Edge Servers maintain WebSocket connections to retail clients, filtering the full market data stream to deliver only symbols in each client's watchlist. Edge servers use consistent hashing by symbol to partition the data load.

Risk & Margin Engine

The Risk Engine performs pre-trade and real-time risk calculations. Pre-trade checks run synchronously before order routing: buying power validation (cash + margin capacity - pending orders >= order value), position concentration limits (no single position >30% of portfolio), and restricted securities list check. Real-time risk runs continuously during market hours: portfolio margin is recalculated on every price tick using TIMS (Theoretical Intermarket Margin System) methodology, computing the worst-case portfolio loss across 16 market scenarios. Margin calls are triggered when account equity drops below maintenance margin (25% for long, 30% for short). The engine uses vectorized SIMD instructions for scenario calculations across thousands of positions.

Database Design

The Order Book is maintained entirely in memory within the matching engine — it is not stored in a traditional database. The binary journal (write-ahead log) persists every order and execution for recovery and audit. Journal entries are written to NVMe SSDs with O_DIRECT to bypass OS page cache, achieving 1M writes/sec. Post-trade, orders and executions are written to TimescaleDB partitioned by trade_date for historical queries and regulatory reporting.

User accounts and portfolios use PostgreSQL: accounts (account_id, user_id, account_type CASH/MARGIN, buying_power, equity, margin_requirement), positions (account_id, symbol, quantity, average_cost, market_value, unrealized_pnl), orders (order_id, account_id, symbol, side BUY/SELL, order_type, quantity, price, filled_quantity, status, timestamps). A Redis cache stores real-time position snapshots updated on every fill for fast portfolio page rendering.

API Design

POST /v1/orders — Place an order; body contains symbol, side (BUY/SELL), order_type (MARKET/LIMIT/STOP/STOP_LIMIT), quantity, price (for limit orders), time_in_force (DAY/GTC/IOC/FOK); returns order_id, status
DELETE /v1/orders/{order_id} — Cancel a pending order; returns cancellation status
GET /v1/marketdata/quotes/{symbol} — REST endpoint for current quote (bid, ask, last, volume); for streaming use WebSocket
WS /v1/marketdata/stream — WebSocket endpoint; client sends subscribe/unsubscribe messages for symbols; server streams real-time quotes and trades

Scaling & Bottlenecks

The matching engine is intentionally single-threaded per symbol — scaling is achieved by sharding symbols across multiple engine instances (e.g., A-F on engine 1, G-M on engine 2). Each engine handles ~1,000 symbols. The opening bell surge is the primary stress point: pre-market accumulated orders flood in simultaneously. The engine uses a call auction mechanism for the opening — accumulating orders for 30 seconds and then computing a single opening price that maximizes matched volume, rather than matching continuously.

Market data distribution is the bandwidth bottleneck. 800K messages/sec at 100 bytes each = 80MB/sec raw data, but fanout to 2M clients would require 160TB/sec without filtering. Symbol-based filtering at the edge tier reduces actual fanout by 99% (average user watches 20 symbols out of 8,000). Edge servers are scaled horizontally with consistent hashing — adding a server requires rebalancing subscriptions, handled by a subscription migration protocol.

Key Trade-offs

Single-threaded matching engine over multi-threaded: Eliminates lock contention and achieves deterministic microsecond latency, but limits per-symbol throughput — acceptable because even the most active symbols rarely exceed 50K orders/sec
In-memory order book with journal over database-backed: Memory provides microsecond access but risks data loss — the binary journal to NVMe with hot standby replication provides durability with 50ms failover
FIX protocol for institutional connectivity over REST: FIX is the industry standard with guaranteed delivery semantics, but its tag-value format is verbose and complex to implement — a FIX-to-internal protocol gateway abstracts this from the core platform
T+1 batch settlement over real-time settlement: Batch clearing via NSCC is the regulatory standard and allows netting (reducing settlement volume by 98%), but creates counterparty risk during the settlement window — mitigated by NSCC's guarantee fund