SYSTEM_DESIGN
System Design: Amazon E-Commerce Platform
Deep dive into designing Amazon's e-commerce platform at scale, covering product catalog, checkout, fulfillment, and personalization systems serving 300 million active customers.
Requirements
Functional Requirements:
- Users can search and browse a catalog of 350+ million products
- Users can add items to cart, apply coupons, and complete checkout
- Sellers can list products, manage inventory, and fulfill orders
- Personalized product recommendations on every page
- Real-time order tracking from placement to delivery
- Customer reviews and ratings with verified purchase badges
Non-Functional Requirements:
- 300M active customers, 50M DAU; peak 50K orders/sec during Prime Day
- Product page load under 200ms (p99); checkout flow under 500ms end-to-end
- 99.999% availability for the checkout path — downtime costs $220K/minute
- Strong consistency for inventory and payments; eventual consistency for reviews and recommendations
- Support 20+ regional marketplaces with localized pricing and tax rules
Scale Estimation
With 50M DAU browsing an average of 30 pages, the system serves 1.5 billion page views/day or ~17,400 requests/sec. Search queries: 20M DAU searching × 5 queries = 100M searches/day = 1,160 QPS. Orders: 1.5M orders/day average, spiking to 50K/sec during Prime Day. Product catalog: 350M products × 2KB metadata = 700GB base data; with images averaging 5 per product at 200KB each = 350TB of image assets. The recommendation engine processes 500M click events/day for model training.
High-Level Architecture
Amazon's architecture is the canonical example of microservices at scale — reportedly over 1,000 services in production. The customer-facing path flows: CloudFront CDN → Application Load Balancer → API Gateway (authentication, rate limiting, routing) → individual microservices. The Product Catalog Service reads from a distributed document store (DynamoDB) with an Elasticsearch cluster for search. The Cart Service uses an in-memory store (ElastiCache Redis) for active carts with DynamoDB as a durable backing store. The Order Service orchestrates checkout via a saga pattern across Inventory, Payment, and Fulfillment services.
The seller-facing path uses a separate API Gateway routing to Seller Central services. Inventory updates from sellers flow through an SQS queue to the Inventory Service, which maintains stock counts in a DynamoDB table with conditional writes to prevent overselling. Price changes propagate through an SNS topic to downstream consumers including the Search Index, Recommendation Service, and the Buy Box algorithm.
A separate analytics plane powered by Kinesis Data Streams ingests all clickstream data into S3 data lakes, feeding Spark-based ML pipelines for recommendation model training. The trained models are deployed to SageMaker endpoints serving real-time inference for the Recommendation Service.
Core Components
Product Catalog Service
The catalog stores 350M+ products in DynamoDB with product_id as the partition key. Each item contains title, description, bullet points, category path, seller_id, and a JSONB attributes field for category-specific attributes (e.g., screen size for electronics). The catalog supports multi-tenant access: Amazon retail and third-party sellers write to the same store. An Elasticsearch cluster (100+ nodes) indexes product data for full-text search with field boosting on title (3x), brand (2x), and description (1x). Search results are re-ranked by a buy-box-aware algorithm factoring price, seller rating, and fulfillment method.
Checkout & Payment Service
Checkout is implemented as a distributed saga spanning 6 services: (1) Cart Validation — verify all items are still available and prices haven't changed; (2) Address Service — validate shipping address and calculate shipping options; (3) Tax Service — compute taxes based on nexus rules; (4) Inventory Reservation — place a soft hold using DynamoDB conditional writes with a 10-minute TTL; (5) Payment Authorization — tokenized card authorization via a PCI-compliant payment gateway; (6) Order Creation — write the confirmed order. If any step fails, compensating transactions roll back prior steps. The saga coordinator uses a Step Functions-style state machine persisted in DynamoDB.
Recommendation Engine
Amazon's recommendation engine uses a hybrid approach combining collaborative filtering (item-to-item via matrix factorization) and content-based signals (product attribute embeddings). The offline pipeline runs on Spark, processing 500M daily click events to retrain models every 4 hours. Online serving uses a two-stage pipeline: a fast retrieval layer using ANN search over item embeddings (FAISS on GPU instances) retrieves 500 candidates, then a ranking model (gradient-boosted trees on features like purchase history, browsing context, price sensitivity) selects the top 20. Results are cached per user in Redis with a 15-minute TTL.
Database Design
The product catalog uses DynamoDB with a GSI on category_id for browse-tree navigation. The Orders table uses order_id as the partition key with a GSI on customer_id + created_at for order history queries. Inventory is stored in a DynamoDB table with product_id + fulfillment_center_id as the composite key; stock counts use atomic counters with conditional updates (decrement only if count >= requested quantity).
For the search index, Elasticsearch stores a denormalized product document including seller info, pricing, and availability. The index is updated via a CDC pipeline from DynamoDB Streams → Lambda → Elasticsearch, with average indexing lag under 5 seconds. Customer data (profiles, addresses, payment methods) resides in a separate encrypted RDS PostgreSQL cluster with row-level encryption for PII fields.
API Design
GET /api/v1/products/search?q={query}&category={id}&sort=relevance&page=1&size=20— Search products with faceted filtering; returns ranked results with Buy Box winnerPOST /api/v1/cart/items— Add item to cart; body contains product_id, quantity, seller_id; returns updated cart with price breakdownPOST /api/v1/orders/checkout— Initiate checkout saga; body contains cart_id, shipping_address_id, payment_method_id; returns order_id and saga statusGET /api/v1/orders/{order_id}/tracking— Real-time order tracking with fulfillment status and estimated delivery
Scaling & Bottlenecks
Prime Day scaling is Amazon's defining challenge — traffic increases 10x over baseline. The system uses pre-warming: auto-scaling groups are scaled up 2 hours before the event, DynamoDB tables switch to on-demand capacity mode, and CDN cache is pre-populated with deal page assets. The checkout path uses a separate reserved capacity pool isolated from browse traffic to ensure orders complete even under extreme load.
The inventory hot-partition problem occurs when a viral product concentrates all writes on a single DynamoDB partition key. Amazon mitigates this with write sharding: the inventory count for hot items is split across N shard keys (e.g., product_123_shard_0 through product_123_shard_9), with reads aggregating across all shards. This spreads write throughput across partitions at the cost of more expensive reads.
Key Trade-offs
- DynamoDB over relational DB for catalog: Schemaless design handles 350M products with heterogeneous attributes, but sacrifices complex query capability — Elasticsearch fills that gap
- Saga over 2PC for checkout: Saga with compensating transactions allows each service to scale independently and tolerate partial failures, but requires idempotent operations and careful failure handling
- Write sharding for hot inventory items: Distributing counts across shards prevents hot partition throttling during flash sales, but aggregation on reads adds latency
- 4-hour recommendation model refresh: Balances freshness with compute cost; real-time signals (current session clicks) are injected at serving time to compensate for model staleness
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.