System Design: ETL Pipeline

Requirements

Functional Requirements:

Extract data from relational databases (CDC and full-load), REST APIs, flat files (CSV, JSON, Parquet), and event streams
Apply transformations: type casting, deduplication, null handling, business rule application, and joins across sources
Load into target systems: data warehouse, data lake, Elasticsearch, and operational databases
Support both scheduled batch runs and trigger-based execution
Provide idempotent re-runs: re-executing a failed job produces the same result without duplicates
Emit lineage metadata to a catalog after each successful load

Non-Functional Requirements:

Process 500 GB per scheduled run within a 2-hour SLA window
Pipeline failures must alert on-call within 5 minutes; failed jobs auto-retry 3 times with exponential backoff
99.5% pipeline success rate measured weekly
Exactly-once semantics for financial data loads; at-least-once acceptable for analytics
Horizontal scalability: adding workers must linearly increase throughput

Scale Estimation

With 200 source tables averaging 2.5 GB each, a nightly full-load processes 500 GB. Incremental CDC pipelines run every 15 minutes and process ~500 MB per cycle across all tables. Peak parallelism requires 50 concurrent transformation tasks. The orchestrator must track ~5,000 task instances per day and store 90 days of execution history (~450,000 records).

High-Level Architecture

The pipeline architecture separates concerns into four layers: Extraction, Transformation, Loading, and Orchestration. The orchestrator (Apache Airflow or Dagster) defines DAGs describing task dependencies, schedules, retry policies, and SLA triggers. Each DAG node corresponds to a Python/Spark operator that encapsulates one logical unit of work.

Extraction agents run at the data source boundary. For OLTP databases, a Debezium CDC agent publishes change events to Kafka topics partitioned by primary key, ensuring ordered delivery per row. For APIs, a connector service polls endpoints, handles pagination and rate limiting, and writes raw responses to the Landing zone in S3. File-based sources are detected via S3 event notifications.

Transformation jobs run on Apache Spark deployed on Kubernetes (or AWS EMR Serverless). Each job reads from the Landing/Bronze zone, applies schema validation using Great Expectations, executes transformation logic, and writes output to Silver/Gold zones using Delta Lake MERGE for idempotent upserts. Job metrics (records processed, error counts, bytes written) are emitted to Prometheus and visualized in Grafana.

Core Components

Orchestration Engine

Apache Airflow manages DAG scheduling, task dependency resolution, and retry logic. Each pipeline is defined as a DAG with tasks connected by directed edges. Sensors wait for upstream data availability (S3KeySensor, KafkaTopicSensor) before triggering downstream tasks. The Airflow metadata database (PostgreSQL) stores task instance state. CeleryExecutor distributes tasks across a worker pool; KubernetesExecutor launches ephemeral pods for resource isolation.

Transformation Engine

Spark jobs handle large-scale transformations. Transformations are defined as composable functions in a PySpark or dbt model. Schema validation runs as a pre-transform step: Great Expectations checks null rates, value ranges, and referential integrity, failing the job with a descriptive error report if thresholds are breached. The transformation layer applies business rules (e.g., currency normalization, PII tokenization) using broadcast joins for lookup tables under 1 GB.

Load & Idempotency Layer

Delta Lake MERGE (upsert) statements ensure idempotent loads: each batch is keyed by a natural key or surrogate hash, and the MERGE updates existing rows or inserts new ones. For append-only fact tables, deduplication is enforced by inserting only rows whose _row_hash does not exist in the target. A load manifest file written to S3 after each successful batch records the batch ID, row count, and checksum, enabling the orchestrator to skip already-completed batches on retry._

Database Design

A pipeline metadata database (PostgreSQL) tracks: pipeline_runs (run_id, dag_id, start_time, end_time, status, records_extracted, records_loaded, error_message), source_tables (source_id, connection_string_ref, last_extracted_watermark, extraction_mode), and load_manifests (manifest_id, run_id, target_table, batch_start, batch_end, row_count, checksum, loaded_at). Watermark columns store the last successfully processed updated_at timestamp per source table for incremental extraction.

API Design

POST /pipelines/{dag_id}/runs — Trigger a manual pipeline run with optional config overrides (date range, full-load flag). GET /pipelines/{dag_id}/runs/{run_id} — Return run status, task-level progress, error details, and metrics. POST /pipelines/{dag_id}/backfill — Schedule a backfill run for a historical date range. GET /sources/{source_id}/schema — Introspect the current schema of a source table with column types and sample values.

Scaling & Bottlenecks

The primary bottleneck is transformation throughput: Spark jobs are limited by cluster size. Auto-scaling EMR clusters (or Spark on Kubernetes with KEDA) add executors when task queue depth exceeds a threshold, scaling from 10 to 100 workers in under 2 minutes. Shuffle operations (wide transformations like GROUP BY and JOIN) are the most expensive; repartitioning data by join key before joining and using broadcast joins for small lookup tables reduces shuffle by 80%.

Orchestrator metadata database becomes a bottleneck above 10,000 concurrent task instances. PostgreSQL with connection pooling (PgBouncer) handles up to 5,000 connections; sharding the metadata DB or migrating to a horizontally scalable store is required at larger scale. Airflow's scheduler CPU usage grows with DAG complexity; deploying multiple schedulers in HA mode and using DAG serialization reduces scheduler load.

Key Trade-offs

CDC vs. full-load extraction: CDC provides low latency and minimal source load but adds infrastructure complexity (Kafka, Debezium, connector maintenance); full-load is simpler but strains source databases and misses deletes.
Spark vs. dbt: Spark handles multi-terabyte transformations and custom Python logic; dbt is simpler for SQL-centric transformations and provides better lineage/documentation but cannot scale beyond what the warehouse engine supports.
Exactly-once vs. at-least-once: Exactly-once semantics require transactional writes and two-phase commits, adding latency and complexity; at-least-once with idempotent upserts achieves the same correctness for most workloads with simpler infrastructure.
Monolithic DAG vs. modular pipelines: A single monolithic DAG is easy to visualize but creates blast radius for failures; breaking into smaller domain DAGs with cross-DAG dependencies improves isolation at the cost of visibility.