System Design: Model Monitoring & Drift Detection

Requirements

Functional Requirements:

Monitor prediction distribution drift for every deployed model in near real-time
Detect input feature drift: statistical changes in the feature distributions seen at serving time vs. training time
Track labeled performance metrics (accuracy, AUC, RMSE) as ground-truth labels become available
Alert model owners when drift or performance degradation crosses configurable thresholds
Provide root cause tools: which features drifted? which data segments degraded?
Trigger automated retraining pipelines when drift or performance thresholds are breached

Non-Functional Requirements:

Drift metrics updated every 15 minutes for high-stakes models, hourly for others
Alert latency under 5 minutes from drift detection to notification delivery
Support 1,000 deployed models with 500 features each = 500,000 monitored feature distributions
Store 90 days of monitoring data for trend analysis and incident investigation
Monitoring overhead on the inference service: less than 1% CPU increase

Scale Estimation

1,000 models * 500 features = 500,000 distribution profiles. Each profile requires computing PSI/KL divergence over the current window (15 minutes of predictions) vs. the reference distribution. At 100,000 requests/second aggregate: 15 minutes = 90 million predictions. Sampling 1% = 900,000 samples per 15-minute window across all models. Per-model: average 900 samples per 15-minute window (for 100 RPS models) up to 90 million (for 100,000 RPS models). Drift computation time: O(feature_count * bin_count) = negligible.

High-Level Architecture

The monitoring system operates in three layers: Data Collection, Drift Analysis, and Alerting & Action. Data Collection captures a sample of inference inputs, outputs, and (when available) ground-truth labels from the inference service via Kafka. The inference service logs predictions to a prediction_logs Kafka topic; a sampling agent applies reservoir sampling to forward a 1% sample to the monitoring system without adding latency to the inference path.

Drift Analysis runs on a schedule: every 15 minutes, a Spark or Python job reads the latest prediction samples, computes drift statistics (Population Stability Index for categorical features, Kolmogorov-Smirnov test for continuous features) against the training reference distribution, and updates a metrics store. For labeled performance, a separate join job matches predictions with ground-truth labels (when available, e.g., from chargeback events, user feedback, labeling queues) and computes accuracy, AUC, and other task-specific metrics.

Alerts are evaluated after each drift computation run. Alert rules specify: metric name, threshold, comparison direction, and consecutive window count (to prevent single-outlier false positives). Multi-window alerting (drift must exceed threshold in 3 consecutive 15-minute windows) reduces false positives by 80%. Automated actions include triggering a retraining pipeline, routing traffic to a backup model, and creating a JIRA incident ticket.

Core Components

Reference Distribution Store

At model training time, the monitoring system computes reference statistics for every feature: mean, standard deviation, percentiles (p1, p5, p25, p50, p75, p95, p99), and a histogram of 50 bins. These are stored in PostgreSQL as the reference profile. At serving time, the same statistics are computed over the monitoring window. Population Stability Index (PSI = Σ (actual% - expected%) * ln(actual%/expected%)) > 0.2 indicates significant drift; > 0.1 indicates moderate drift.*

Label Joining Service

For supervised models, monitoring effectiveness depends on label availability. Labels arrive with varying delays: fraud labels in 30 days, content quality labels in 7 days, immediate feedback (thumbs up/down) in seconds. The label joining service maintains an open join window per model: for each prediction event, it waits up to max_label_delay for a matching label keyed by request_id or entity_id+timestamp. Matched prediction-label pairs are written to a labeled_predictions table consumed by the performance metrics job.

Drift Alerting Engine

The alerting engine evaluates drift rules after every computation cycle. Rules are stored in a rules table with: model_id, feature_name (or __prediction__ for output drift), metric_type (PSI, KS, performance_auc), threshold, consecutive_windows, severity, and action list. Consecutive window tracking prevents single-outlier false positives. When an alert fires, the engine publishes to PagerDuty (CRITICAL), Slack (HIGH), or JIRA (MEDIUM), and optionally triggers the retraining pipeline API.

Database Design

TimescaleDB for drift metrics (time-series): (model_id, feature_name, window_start TIMESTAMP, psi_score FLOAT, ks_statistic FLOAT, ks_pvalue FLOAT, sample_count INT) with automatic partitioning by month and compression after 30 days. PostgreSQL for reference distributions: model_reference_profiles (model_id, model_version, feature_name, stats_json JSONB, histogram_json JSONB, computed_at). PostgreSQL for alert history: drift_alerts (alert_id, model_id, feature_name, metric_value, threshold, severity, fired_at, resolved_at, action_taken).

API Design

GET /models/{model_id}/drift?from={ts}&to={ts} — Return drift metrics time series for all features of a model over the specified window. GET /models/{model_id}/performance?metric=auc&from={ts}&to={ts} — Return labeled performance metrics time series. POST /models/{model_id}/alerts — Create a drift alert rule with threshold, metric, and action configuration. GET /models/{model_id}/drift/summary — Return a summary of current drift status: features with high PSI, prediction distribution shift, and current performance vs. training-time baseline.

Scaling & Bottlenecks

Drift computation for 500,000 features across 1,000 models every 15 minutes requires: 500,000 PSI computations / 900 seconds = 556 computations/second. Each PSI computation on 50 bins takes 50 microseconds; 556/second is trivially handled by a single CPU. The bottleneck is data aggregation: computing histograms over millions of prediction samples requires efficient streaming aggregation. Using Apache DataSketches (KLL sketch for quantiles, CPC sketch for distinct counts) reduces memory from O(N) to O(log N) while maintaining accuracy within 1%.

Label delay is the primary challenge for performance monitoring: models with long label delays (fraud: 30 days) cannot detect concept drift in near real-time. Proxy metrics (prediction confidence distribution, prediction class ratios) serve as early warning signals before true labels arrive. When proxy metrics drift, the team is alerted to investigate immediately; labeled performance confirmation arrives 30 days later.

Key Trade-offs

Full sampling vs. reservoir sampling: Full logging of all predictions provides complete accuracy for drift computation but creates enormous data volumes; 1% reservoir sampling provides statistically equivalent drift detection for models with >1,000 daily predictions.
Statistical significance vs. alert sensitivity: Strict statistical thresholds (KS p-value < 0.001) reduce false positives but miss gradual drift; looser thresholds (PSI > 0.1) catch gradual drift earlier but produce more false positives.
Automated retraining vs. human-in-loop: Automated retraining on drift events accelerates recovery but risks training on a temporarily anomalous distribution; human review before retraining prevents false triggers at the cost of recovery time.
Feature-level vs. model-level drift monitoring: Model-level output drift is easy to compute but doesn't explain the cause; feature-level drift identifies root causes but requires 500x more monitoring infrastructure.