System Design: Data Quality Platform

Requirements

Functional Requirements:

Define quality rules: schema validation, null rate thresholds, value range checks, referential integrity, and statistical distribution tests
Automatically detect anomalies in metrics (row count, null rate, mean, stddev) using ML-based thresholds
Run quality checks as part of ETL pipelines (blocking: fail the pipeline on violation) or as background monitors (non-blocking: alert only)
Track quality scores per dataset over time with trend visualization
Generate quality reports for data governance teams and produce SLA compliance dashboards
Integrate with Slack, PagerDuty, and JIRA for alerting and ticket creation

Non-Functional Requirements:

Quality checks on a 100 GB table must complete within 5 minutes
Support 10,000 monitored datasets with checks running on daily or hourly cadences
Alert latency under 2 minutes from anomaly detection to notification delivery
Quality score history retained for 2 years for trend analysis
Platform must handle 1,000 concurrent check executions

Scale Estimation

With 10,000 monitored datasets and an average of 10 checks per dataset, running hourly: 100,000 check executions per hour = 1,667 checks per minute. Each check computes 5–20 metrics from the dataset, resulting in 1 million metric data points per hour. After 2 years: 17.5 billion metric points. Using a time-series database with compression (10 bytes/point): ~175 GB for the metrics store.

High-Level Architecture

The platform has three subsystems: Check Execution, Anomaly Detection, and Alerting. Check Execution runs SQL-based quality checks (pushed down to the data warehouse or Spark cluster for scalability) and collects metric results. Anomaly Detection applies rule-based thresholds and ML-based models to metric time series to classify each metric as PASS, WARN, or FAIL. Alerting routes failures through notification channels and creates incident tickets.

Checks are defined as YAML configurations attached to each dataset in the catalog. A Check Compiler translates YAML definitions into optimized SQL queries (single-pass scans computing multiple metrics simultaneously) submitted to the warehouse. Results are written to the metrics store and compared against expected ranges. For custom checks, a Python SDK allows data engineers to write Great Expectations suites or Pandera schemas that run as Spark or pandas jobs.

An ML-based anomaly detection layer continuously trains univariate time-series models (Prophet, STL decomposition) per metric per dataset. Models learn seasonality patterns (weekday vs. weekend row counts, end-of-month spikes) and automatically adjust expected ranges. Alerts fire when a metric deviates more than 3 standard deviations from the model's prediction, reducing false positives by 80% compared to static thresholds.

Core Components

Rule Engine

The rule engine parses YAML check definitions and generates SQL. A completeness check (null_rate < 0.01 for column email) compiles to: SELECT COUNT(*) FILTER (WHERE email IS NULL) * 1.0 / COUNT(*) AS null_rate FROM table. Multiple checks are combined into a single SQL query with multiple aggregations, minimizing warehouse scans. Results are compared against thresholds; a check passes if all metrics are within bounds.*

Anomaly Detection Service

For each monitored metric, a Prophet model is trained weekly on 90 days of historical data. The model predicts the expected range (lower_bound, upper_bound) for the current time window. Actual values outside this range trigger anomaly events. Models are stored in MLflow and versioned per dataset-metric combination. For new datasets with less than 14 days of history, the system falls back to static rule-based thresholds until enough data accumulates.

Alerting & Incident Management

A routing engine maps dataset-level severity (CRITICAL, HIGH, MEDIUM, LOW) to notification channels. CRITICAL failures (affecting production dashboards or SLA-governed tables) page the on-call data engineer via PagerDuty. HIGH failures create JIRA tickets with check details, affected pipeline names (from lineage), and suggested remediation. MEDIUM/LOW send Slack messages to the owning team's channel. Deduplication windows prevent alert storms: only one alert per dataset per 30-minute window.

Database Design

PostgreSQL stores check definitions: quality_rules (rule_id, dataset_id, check_type, config_yaml, severity, blocking_flag), check_runs (run_id, rule_id, executed_at, status, duration_ms, metrics_json), and quality_scores (dataset_id, scored_at, completeness_score, validity_score, timeliness_score, overall_score). A TimescaleDB hypertable stores metric time series: (dataset_id, metric_name, measured_at, value, expected_lower, expected_upper, status) with automatic partitioning by month and compression after 7 days.

API Design

POST /datasets/{dataset_id}/rules — Define a new quality rule with check type, column target, threshold, and severity. POST /datasets/{dataset_id}/checks/run — Trigger an immediate quality check run for a dataset. GET /datasets/{dataset_id}/quality-score — Return the current quality score breakdown and 30-day trend. GET /anomalies?dataset_id={id}&from={ts}&to={ts} — Return anomaly events within a time range with metric values and expected ranges.

Scaling & Bottlenecks

Warehouse query concurrency is the primary bottleneck: 1,000 concurrent check queries can overwhelm the warehouse query queue. A rate limiter caps check query submission at 200 concurrent queries, with the remainder queued by priority. Lightweight statistical checks (row count, null rate) use sampling (1% random sample) for quick results on large tables, with full-scan checks reserved for daily deep-quality runs.

Anomaly detection model training at scale (10,000 models weekly) can bottleneck on compute. Training is parallelized across a Spark cluster; each Prophet model trains in under 10 seconds, so 10,000 models train in under 10 minutes across 200 parallel workers. Incremental retraining (adding new data points to existing models rather than full retraining) reduces weekly retraining time by 80%.

Key Trade-offs

Blocking vs. non-blocking checks: Blocking checks prevent bad data from reaching consumers but halt pipelines and delay downstream jobs; non-blocking monitoring allows data to flow with alerting, requiring downstream consumers to handle quality issues gracefully.
Rule-based vs. ML-based thresholds: Rule-based thresholds are interpretable and fast to configure but require manual tuning per dataset; ML-based anomaly detection adapts automatically to seasonal patterns but requires training data and is harder to debug.
Full-scan vs. sampling: Full scans are accurate but expensive for TB-scale tables; statistical sampling with 95% confidence intervals provides 99% detection accuracy at 1% of the cost.
Centralized vs. embedded quality checks: Centralized platform provides unified visibility and governance; embedded checks (Great Expectations suites committed alongside pipeline code) enable dev-time testing but fragment quality visibility across teams.