System Design: Shopify (Multi-tenant E-Commerce)

Requirements

Functional Requirements:

Merchants create customizable online storefronts with themes and apps
Multi-channel selling: online store, POS, social media, marketplaces
Product catalog management, inventory tracking across locations
Checkout with support for 100+ payment gateways
App ecosystem: third-party developers extend functionality via APIs
Admin dashboard with analytics, order management, and customer data

Non-Functional Requirements:

2M+ active merchants, 500M+ unique shoppers visiting Shopify stores
Checkout must handle 40K+ checkouts/minute during flash sales (BFCM peak)
99.99% availability — merchant revenue depends on uptime
Tenant isolation: one merchant's traffic spike must not affect other merchants
Sub-500ms storefront page load; checkout completion under 3 seconds

Scale Estimation

With 2M merchants averaging 250 daily visitors each: 500M unique visitors/day. Page views: 500M × 4 pages = 2B page views/day = 23,148 views/sec. Checkouts: 5M orders/day average, spiking to 40K/min during BFCM = 667 checkouts/sec at peak. Product catalog: 2M merchants × 500 products average = 1B products. Storage: product metadata at 2KB × 1B = 2TB; product images at 10 per product × 200KB = 2PB.

High-Level Architecture

Shopify's architecture uses a pod-based multi-tenancy model. Merchants are assigned to pods — each pod is a self-contained deployment containing application servers (Ruby on Rails), a MySQL database cluster (sharded by shop_id), Redis caches, and job workers. This pod architecture provides blast radius containment: if a pod goes down, only the merchants on that pod are affected, not the entire platform. Pods are deployed across multiple regions with active-active replication.

Storefront rendering uses a two-tier approach. Liquid (Shopify's templating language) templates are rendered server-side by the Storefront Renderer, a Go-based service that replaced the original Rails renderer for 10x performance improvement. Static assets and rendered pages are cached at the CDN edge (Cloudflare) with cache keys incorporating the shop domain, template version, and product inventory hash. Cache invalidation is event-driven: product updates, inventory changes, and theme edits emit events to a Kafka topic consumed by a CDN Purge Service.

Checkout is the most critical path and runs on dedicated infrastructure (Checkout Service) isolated from storefront traffic. The Checkout Service orchestrates: cart validation → shipping rate calculation → tax computation → payment processing (via Shopify Payments or 100+ third-party gateways) → order creation. Each step is idempotent, allowing safe retries on failure.

Core Components

Pod-Based Multi-Tenancy

Each pod hosts approximately 10,000 merchants on a dedicated MySQL cluster (primary + 2 read replicas). The routing layer (a global load balancer) maps shop domains to pods using a configuration stored in a globally-replicated key-value store (Consul). When a pod reaches capacity, a live migration tool moves merchants to a new pod by replicating their MySQL data, switching the routing entry, and draining old connections — all with zero downtime. Noisy neighbor detection monitors per-merchant query rates and automatically rate-limits abusive traffic.

Storefront Renderer

The Storefront Renderer compiles Liquid templates into an intermediate representation cached in Redis. At request time, the renderer fetches the compiled template and product data (from MySQL via a read-through Redis cache), executes the template, and returns HTML. For high-traffic stores, entire page renders are cached at the CDN with stale-while-revalidate semantics: the CDN serves a slightly stale page while asynchronously fetching a fresh render in the background, ensuring sub-100ms response times.

App Extension Platform

Shopify's app ecosystem uses a sandboxed execution model. Apps register webhooks for events (order.created, product.updated) and receive HTTP callbacks. For storefront extensions, apps inject UI via Script Tags or the newer App Blocks (rendered in iframes with postMessage communication). The App Proxy allows apps to serve custom pages under the merchant's domain. Rate limiting per app per store (40 requests/sec for REST, 1000 cost points/sec for GraphQL) prevents runaway apps from degrading store performance.

Database Design

Each pod's MySQL cluster uses a schema-per-tenant model where all merchants share the same schema but are isolated by a shop_id column on every table. Core tables: shops (shop_id, domain, plan, settings), products (product_id, shop_id, title, body_html, vendor, product_type), variants (variant_id, product_id, price, sku, inventory_quantity), orders (order_id, shop_id, customer_id, total_price, financial_status, fulfillment_status, created_at). Indexes are composite with shop_id as the leading column for partition pruning.

A separate analytics data warehouse (ClickHouse) stores aggregated merchant analytics: sales, traffic, conversion rates. Raw event data flows from Kafka into the warehouse with a 5-minute delay. Merchant-facing analytics dashboards query ClickHouse via a caching layer (Redis with 1-minute TTL for dashboard tiles).

API Design

GET /admin/api/2024-01/products.json?limit=50&since_id={id} — Fetch merchant's products with cursor pagination
POST /admin/api/2024-01/orders.json — Create an order programmatically (draft orders, POS); body contains line_items, customer, shipping
POST /api/2024-01/checkouts.json — Create a checkout; returns checkout token and available shipping rates
POST /admin/api/2024-01/webhooks.json — Register a webhook; body contains topic (e.g., orders/create) and callback URL

Scaling & Bottlenecks

Black Friday / Cyber Monday (BFCM) is Shopify's biggest scaling event — 40K+ checkouts/minute across all merchants. Preparation starts months ahead: capacity planning based on merchant growth projections, pre-provisioning additional pods, and load testing with synthetic traffic at 2x projected peak. The checkout path uses a separate auto-scaling group with pre-warmed instances. Database read replicas are added to pods hosting merchants with known flash sale events.

The multi-tenant noisy neighbor problem is mitigated at multiple layers: per-merchant request rate limiting at the load balancer, per-merchant query quotas at the database level (enforced by a MySQL proxy), and per-merchant background job quotas in the Sidekiq worker fleet. A merchant running a flash sale that generates 10x normal traffic is automatically routed through a high-traffic handling path that increases their rate limits while adding additional monitoring.

Key Trade-offs

Pod architecture over single shared cluster: Pod isolation limits blast radius and enables independent scaling, but increases operational complexity — each pod is a mini-infrastructure to manage
Liquid over JavaScript for templating: Liquid's sandboxed execution prevents malicious or buggy themes from crashing the renderer, but limits the expressiveness available to theme developers
Schema-per-tenant with shop_id column over database-per-tenant: Shared schema simplifies migrations and reduces connection overhead, but requires careful query planning to prevent cross-tenant data leaks
CDN caching with event-driven invalidation: Aggressive caching achieves sub-100ms page loads, but complex invalidation logic (product update → purge all pages containing that product) is error-prone and adds operational overhead