System Design: Data Lake

Requirements

Functional Requirements:

Ingest data from relational databases, event streams, APIs, logs, images, and documents
Store data in its native format without upfront schema enforcement (schema-on-read)
Provide a unified catalog so consumers can discover and understand available datasets
Support multiple query engines (Spark, Presto, Athena) over the same underlying storage
Enforce data retention policies, PII masking, and access control at dataset and column level
Enable ML teams to read labeled datasets and write model artifacts back to the lake

Non-Functional Requirements:

Ingest throughput of 10 GB/s sustained across all sources
Query SLA: interactive queries under 30 seconds, batch jobs complete within SLA windows
99.99% durability for stored objects (achieved via multi-AZ replication)
Compliance with GDPR right-to-erasure: purge a user's data across all datasets within 30 days
Support 1,000 concurrent read clients across all query engines

Scale Estimation

With 500 data sources each averaging 20 GB/day, total daily ingestion is 10 TB. After 3 years: ~11 PB raw. At 3x compression for structured data (but 1x for images/video), effective storage is ~6 PB. The metadata catalog must track ~50 million data files. The discovery service must handle 10,000 search queries per day from data engineers and analysts.

High-Level Architecture

The lake is organized into four storage zones: Landing (raw bytes, immutable, 7-day TTL), Bronze (partitioned by source and date, schema detected, indefinite retention), Silver (cleansed, deduplicated, typed), and Gold (aggregated, business-ready). Object storage (S3 or Azure ADLS Gen2) underpins all zones with lifecycle policies automating tier transitions (S3 Standard → Intelligent-Tiering → Glacier).

An ingestion layer handles diverse source types: Kafka Connect for CDC streams, Airbyte or Fivetran for SaaS APIs, AWS DataSync for on-prem file transfers, and custom agents for log shipping. All ingested objects are registered in the catalog immediately after write via S3 event notifications triggering a Lambda/Cloud Function that upserts catalog entries.

Governance is centralized through Apache Atlas or AWS Lake Formation. A policy engine evaluates every query request against column-level sensitivity tags, user group memberships, and data residency constraints before granting access. Dynamic data masking replaces PII columns with deterministic pseudonyms for non-privileged users at query time.

Core Components

Object Storage Layer

S3 stores all data objects. Prefixes follow the pattern s3://lake/{zone}/{source}/{year}/{month}/{day}/{hour}/. Multi-part upload handles objects above 100 MB for throughput. S3 Object Lock (WORM) is enabled on the Landing zone for regulatory immutability. Cross-region replication to a secondary region provides disaster recovery with RPO of 15 minutes.

Metadata Catalog

Apache Hive Metastore (or AWS Glue Data Catalog) stores table definitions, partition locations, and schema versions. On top of it, Apache Atlas or OpenMetadata adds business metadata: dataset owners, data classification tags (PII, PHI, financial), lineage edges, and quality scores. Consumers search the catalog via a web UI or REST API before accessing data, reducing the incidence of analysts querying wrong or stale tables.

Ingestion & Schema Registry

For streaming sources, Schema Registry (Confluent or AWS Glue Schema Registry) enforces Avro or Protobuf schemas at publish time, preventing malformed events from corrupting downstream consumers. CDC pipelines use Debezium to capture row-level changes from PostgreSQL/MySQL and publish them to Kafka topics, which Delta Lake MERGE operations apply to Silver tables every 5 minutes.

Database Design

Data lake tables use Apache Iceberg as the table format, providing ACID transactions, partition evolution, and schema evolution without full table rewrites. Each Silver table has mandatory metadata columns: _source_system VARCHAR, _ingested_at TIMESTAMP, _event_time TIMESTAMP, _row_hash VARCHAR(32). Partitioning is by _event_time truncated to day, with hidden partitioning available for high-cardinality columns._

For the GDPR right-to-erasure requirement, a separate pii_index table maps user_id to all Iceberg snapshot files and row offsets containing that user's data. A deletion job uses Iceberg's row-level delete files (position deletes) to logically erase records without rewriting entire partitions, then rewrites the affected files during the next compaction cycle.

API Design

POST /catalog/datasets — Register a new dataset with schema, owner, tags, and storage location. GET /catalog/search?q={term}&tags={tag}&zone={zone} — Full-text search across dataset names, descriptions, and tags. POST /ingest/jobs — Submit a batch ingestion job specifying source config, target zone, and schedule. GET /lineage/{dataset_id}?direction=upstream|downstream&depth=3 — Return lineage graph as a DAG of dataset nodes and transform edges.

Scaling & Bottlenecks

Small-file proliferation is the most common bottleneck: streaming ingestion creates thousands of tiny files per hour per table, degrading list and metadata operations. An automated compaction service (Spark job triggered by file-count threshold) merges files to 256 MB target size every hour, reducing file count by 90%. Iceberg's rewrite_data_files procedure handles this transactionally.

Metadata catalog scalability becomes an issue above 100 million partitions. Hive Metastore struggles at this scale; migrating to Iceberg REST Catalog backed by a distributed key-value store (DynamoDB or CockroachDB) removes this bottleneck. Query engine cold starts (spinning up Spark clusters) add 2–3 minutes latency; a warm pool of pre-initialized executors reduces this to under 10 seconds for interactive workloads.

Key Trade-offs

Schema-on-read vs. schema-on-write: Schema-on-read maximizes ingestion flexibility but pushes validation cost to query time and allows garbage data to accumulate; adding schema validation at the Silver layer boundary catches errors earlier.
Open table formats (Iceberg/Delta) vs. proprietary warehouse: Open formats give vendor flexibility and multi-engine access but require managing compaction, vacuum, and catalog infrastructure that managed warehouses handle automatically.
Centralized vs. federated governance: A central policy engine is easier to audit but creates a bottleneck for policy updates; federated Lake Formation grants per-domain let teams move faster at the cost of policy inconsistency.
Hot vs. cold storage tiering: Intelligent-Tiering reduces cost for infrequently accessed data by 60% but adds 10–20ms retrieval overhead; latency-sensitive ML training jobs should always pin datasets to S3 Standard.