System Design: Seller Analytics Dashboard

Requirements

Functional Requirements:

Real-time sales dashboard: revenue, orders, units sold, average order value
Traffic analytics: page views, conversion rate, traffic sources, bounce rate
Product performance: top sellers, slow movers, listing quality scores
Inventory forecasting: predicted stockout dates based on sales velocity
Comparison periods: today vs. yesterday, this month vs. last month
Exportable reports: CSV/PDF download of analytics data

Non-Functional Requirements:

Support 500K active sellers each with their own dashboard
Real-time metrics: sales data visible within 30 seconds of order placement
Dashboard page load under 2 seconds including all chart data
Historical data retention: 2 years at daily granularity, 90 days at hourly
99.9% availability; stale data is acceptable over unavailability
Cost-efficient: analytics infrastructure cost under $0.10/seller/month

Scale Estimation

500K sellers generating 5M orders/day = 10 orders/seller/day average. Event ingestion: each order generates ~10 analytics events (page view, add to cart, checkout start, payment, confirmation, etc.) = 50M events/day = 579 events/sec. Dashboard queries: 100K sellers viewing dashboards daily × 5 page loads = 500K dashboard queries/day = 5.8 QPS. Each dashboard query fetches 8-12 metric tiles. Historical data: 5M orders/day × 365 days × 2 years = 3.65B order records at 500 bytes = 1.8TB.

High-Level Architecture

The analytics platform follows a lambda architecture with a real-time (speed) layer and batch layer. The speed layer: order and clickstream events flow from application services into Kafka → a Flink streaming job computes real-time aggregations (rolling 24-hour totals by seller_id) → results written to Redis hashes keyed by metrics:{seller_id}:{date}. The dashboard reads from Redis for today's metrics, providing sub-second data freshness.

The batch layer: a nightly Spark job reads the full event log from S3 (raw events are archived from Kafka to S3 via a Kafka Connect S3 sink) → computes daily and monthly aggregations → writes to ClickHouse (columnar OLAP database) for historical queries. ClickHouse powers the historical comparison features (this month vs. last month) and detailed drill-down queries.

The serving layer: the Dashboard API reads from Redis for real-time metrics and ClickHouse for historical data. Results are cached per seller in a Redis cache layer with a 60-second TTL for dashboard tiles. The frontend uses a React dashboard with WebSocket updates for live sales tickers.

Core Components

Real-Time Aggregation Pipeline (Flink)

A Flink streaming job consumes from the order-events and clickstream-events Kafka topics. It maintains keyed state (keyed by seller_id) with tumbling windows of 1 minute. Aggregations computed: order_count, revenue_sum, units_sold, unique_visitors (HyperLogLog), page_views, add_to_cart_count, checkout_starts, conversion_rate (orders / unique_visitors). Results are written to Redis every minute. The Flink job uses RocksDB state backend for fault tolerance, with checkpoints every 30 seconds to S3. This provides exactly-once processing semantics.

ClickHouse Analytical Store

ClickHouse stores pre-aggregated daily metrics in a MergeTree table: seller_daily_metrics (seller_id, date, revenue, order_count, units_sold, avg_order_value, unique_visitors, page_views, conversion_rate, top_products Array(Tuple(product_id, revenue)), traffic_sources Map(String, Int)). Partitioned by month, ordered by (seller_id, date) for fast seller-specific range queries. A materialized view auto-computes monthly rollups. Query latency for a single seller's 30-day trend: <50ms on ClickHouse.

Inventory Forecasting Engine

The forecasting engine predicts stockout dates using sales velocity analysis. For each product, it computes the trailing 14-day average daily sales rate, applies a seasonality multiplier (derived from the same period last year), and projects days_until_stockout = current_stock / adjusted_daily_rate. The forecast runs as a nightly Spark job and stores predictions in PostgreSQL: forecasts table (product_id, seller_id, current_stock, daily_rate, predicted_stockout_date, confidence_interval). Sellers see a 'Low Stock Alert' on their dashboard when predicted_stockout_date < 14 days.

Database Design

Redis data model for real-time metrics: Hash metrics:{seller_id}:{date} → {revenue: 1234.56, orders: 47, units: 89, visitors: 1203, page_views: 4521, conversion_rate: 0.039}. Each field is updated atomically by the Flink pipeline using HINCRBYFLOAT for numeric counters and PFADD for the HyperLogLog visitor count (stored in a separate key hll:{seller_id}:{date}).

ClickHouse schema: seller_daily_metrics (seller_id UInt64, date Date, revenue Decimal(12,2), order_count UInt32, units_sold UInt32, avg_order_value Decimal(10,2), unique_visitors UInt32, page_views UInt32, conversion_rate Float32, top_products Array(Tuple(UInt64, Decimal(10,2))), traffic_sources Map(String, UInt32)) ENGINE = MergeTree() PARTITION BY toYYYYMM(date) ORDER BY (seller_id, date). Retention: raw daily data for 2 years (730 partitions), monthly aggregates indefinitely.

API Design

GET /api/v1/sellers/{id}/metrics/realtime — Current day's real-time metrics from Redis; includes revenue, orders, visitors, conversion rate
GET /api/v1/sellers/{id}/metrics/history?start=2025-01-01&end=2025-01-31&granularity=daily — Historical metrics from ClickHouse with comparison period
GET /api/v1/sellers/{id}/products/performance?sort=revenue&period=30d&limit=20 — Top/bottom product performance with trends
GET /api/v1/sellers/{id}/inventory/forecast — Stockout predictions for all active products

Scaling & Bottlenecks

The Flink streaming job must handle 579 events/sec with keyed state for 500K sellers. Flink parallelism is set to 16 task slots, each handling ~36 events/sec — well within Flink's capacity. The RocksDB state backend stores 500K seller aggregation states at ~1KB each = 500MB — fits comfortably in local SSD. Checkpointing to S3 every 30 seconds adds ~5% throughput overhead.

ClickHouse query scaling: 5.8 dashboard QPS with each query touching a single seller's data is trivial for ClickHouse (single-node can handle 100+ QPS for point queries). The challenge is the nightly Spark aggregation job that writes 500K rows to ClickHouse — batch inserts of 10K rows at a time complete in under 10 minutes. For cost efficiency, a single ClickHouse node with 16 cores and 64GB RAM handles the entire workload, with a replica for high availability.

Key Trade-offs

Lambda architecture over pure streaming: Flink handles real-time, Spark handles batch corrections — more complex than pure streaming (Kappa architecture) but provides guaranteed accuracy in historical data through batch reprocessing
ClickHouse over PostgreSQL for analytics: 10-100x faster for analytical queries (columnar storage, vectorized execution), but adds another database technology to operate
60-second dashboard cache TTL: Reduces ClickHouse query load by 90% during heavy dashboard usage, but sellers see slightly stale data — real-time Redis metrics compensate for today's numbers
HyperLogLog for unique visitors: 0.81% standard error is acceptable for analytics, and uses 12KB per counter instead of storing individual visitor IDs — 500K sellers × 12KB = 6GB vs. potentially hundreds of GB for exact counts