SYSTEM_DESIGN
System Design: Model Monitoring & Drift Detection
Design a comprehensive ML model monitoring system that tracks prediction quality, detects data drift and concept drift, and alerts teams before model degradation impacts business outcomes. Covers statistical drift tests, labeling pipelines, and automated retraining triggers.
Requirements
Functional Requirements:
- Monitor prediction distribution drift for every deployed model in near real-time
- Detect input feature drift: statistical changes in the feature distributions seen at serving time vs. training time
- Track labeled performance metrics (accuracy, AUC, RMSE) as ground-truth labels become available
- Alert model owners when drift or performance degradation crosses configurable thresholds
- Provide root cause tools: which features drifted? which data segments degraded?
- Trigger automated retraining pipelines when drift or performance thresholds are breached
Non-Functional Requirements:
- Drift metrics updated every 15 minutes for high-stakes models, hourly for others
- Alert latency under 5 minutes from drift detection to notification delivery
- Support 1,000 deployed models with 500 features each = 500,000 monitored feature distributions
- Store 90 days of monitoring data for trend analysis and incident investigation
- Monitoring overhead on the inference service: less than 1% CPU increase
Scale Estimation
1,000 models * 500 features = 500,000 distribution profiles. Each profile requires computing PSI/KL divergence over the current window (15 minutes of predictions) vs. the reference distribution. At 100,000 requests/second aggregate: 15 minutes = 90 million predictions. Sampling 1% = 900,000 samples per 15-minute window across all models. Per-model: average 900 samples per 15-minute window (for 100 RPS models) up to 90 million (for 100,000 RPS models). Drift computation time: O(feature_count * bin_count) = negligible.
High-Level Architecture
The monitoring system operates in three layers: Data Collection, Drift Analysis, and Alerting & Action. Data Collection captures a sample of inference inputs, outputs, and (when available) ground-truth labels from the inference service via Kafka. The inference service logs predictions to a prediction_logs Kafka topic; a sampling agent applies reservoir sampling to forward a 1% sample to the monitoring system without adding latency to the inference path.
Drift Analysis runs on a schedule: every 15 minutes, a Spark or Python job reads the latest prediction samples, computes drift statistics (Population Stability Index for categorical features, Kolmogorov-Smirnov test for continuous features) against the training reference distribution, and updates a metrics store. For labeled performance, a separate join job matches predictions with ground-truth labels (when available, e.g., from chargeback events, user feedback, labeling queues) and computes accuracy, AUC, and other task-specific metrics.
Alerts are evaluated after each drift computation run. Alert rules specify: metric name, threshold, comparison direction, and consecutive window count (to prevent single-outlier false positives). Multi-window alerting (drift must exceed threshold in 3 consecutive 15-minute windows) reduces false positives by 80%. Automated actions include triggering a retraining pipeline, routing traffic to a backup model, and creating a JIRA incident ticket.
Core Components
Reference Distribution Store
At model training time, the monitoring system computes reference statistics for every feature: mean, standard deviation, percentiles (p1, p5, p25, p50, p75, p95, p99), and a histogram of 50 bins. These are stored in PostgreSQL as the reference profile. At serving time, the same statistics are computed over the monitoring window. Population Stability Index (PSI = Σ (actual% - expected%) * ln(actual%/expected%)) > 0.2 indicates significant drift; > 0.1 indicates moderate drift.*
Label Joining Service
For supervised models, monitoring effectiveness depends on label availability. Labels arrive with varying delays: fraud labels in 30 days, content quality labels in 7 days, immediate feedback (thumbs up/down) in seconds. The label joining service maintains an open join window per model: for each prediction event, it waits up to max_label_delay for a matching label keyed by request_id or entity_id+timestamp. Matched prediction-label pairs are written to a labeled_predictions table consumed by the performance metrics job.
Drift Alerting Engine
The alerting engine evaluates drift rules after every computation cycle. Rules are stored in a rules table with: model_id, feature_name (or __prediction__ for output drift), metric_type (PSI, KS, performance_auc), threshold, consecutive_windows, severity, and action list. Consecutive window tracking prevents single-outlier false positives. When an alert fires, the engine publishes to PagerDuty (CRITICAL), Slack (HIGH), or JIRA (MEDIUM), and optionally triggers the retraining pipeline API.
Database Design
TimescaleDB for drift metrics (time-series): (model_id, feature_name, window_start TIMESTAMP, psi_score FLOAT, ks_statistic FLOAT, ks_pvalue FLOAT, sample_count INT) with automatic partitioning by month and compression after 30 days. PostgreSQL for reference distributions: model_reference_profiles (model_id, model_version, feature_name, stats_json JSONB, histogram_json JSONB, computed_at). PostgreSQL for alert history: drift_alerts (alert_id, model_id, feature_name, metric_value, threshold, severity, fired_at, resolved_at, action_taken).
API Design
GET /models/{model_id}/drift?from={ts}&to={ts} — Return drift metrics time series for all features of a model over the specified window.
GET /models/{model_id}/performance?metric=auc&from={ts}&to={ts} — Return labeled performance metrics time series.
POST /models/{model_id}/alerts — Create a drift alert rule with threshold, metric, and action configuration.
GET /models/{model_id}/drift/summary — Return a summary of current drift status: features with high PSI, prediction distribution shift, and current performance vs. training-time baseline.
Scaling & Bottlenecks
Drift computation for 500,000 features across 1,000 models every 15 minutes requires: 500,000 PSI computations / 900 seconds = 556 computations/second. Each PSI computation on 50 bins takes 50 microseconds; 556/second is trivially handled by a single CPU. The bottleneck is data aggregation: computing histograms over millions of prediction samples requires efficient streaming aggregation. Using Apache DataSketches (KLL sketch for quantiles, CPC sketch for distinct counts) reduces memory from O(N) to O(log N) while maintaining accuracy within 1%.
Label delay is the primary challenge for performance monitoring: models with long label delays (fraud: 30 days) cannot detect concept drift in near real-time. Proxy metrics (prediction confidence distribution, prediction class ratios) serve as early warning signals before true labels arrive. When proxy metrics drift, the team is alerted to investigate immediately; labeled performance confirmation arrives 30 days later.
Key Trade-offs
- Full sampling vs. reservoir sampling: Full logging of all predictions provides complete accuracy for drift computation but creates enormous data volumes; 1% reservoir sampling provides statistically equivalent drift detection for models with >1,000 daily predictions.
- Statistical significance vs. alert sensitivity: Strict statistical thresholds (KS p-value < 0.001) reduce false positives but miss gradual drift; looser thresholds (PSI > 0.1) catch gradual drift earlier but produce more false positives.
- Automated retraining vs. human-in-loop: Automated retraining on drift events accelerates recovery but risks training on a temporarily anomalous distribution; human review before retraining prevents false triggers at the cost of recovery time.
- Feature-level vs. model-level drift monitoring: Model-level output drift is easy to compute but doesn't explain the cause; feature-level drift identifies root causes but requires 500x more monitoring infrastructure.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.