System Design: Data Catalog

Requirements

Functional Requirements:

Auto-discover and register datasets from data lake, warehouse, operational databases, and BI tools
Enable full-text and faceted search across dataset names, descriptions, column names, and tags
Display column-level data lineage: which upstream datasets and transformations produced this column
Support business glossary: map technical column names to business terms
Allow data stewards to annotate datasets with ownership, sensitivity classification, and quality scores
Enforce access request workflows: users can request access to restricted datasets via the catalog

Non-Functional Requirements:

Catalog must index 10 million metadata objects (tables, columns, dashboards, pipelines)
Search queries return results in under 200ms at the 95th percentile
Metadata freshness: new tables and schema changes reflected within 5 minutes of creation
99.9% availability; catalog reads must not be blocked by metadata ingestion writes
Support 5,000 daily active users with peak 500 concurrent search sessions

Scale Estimation

With 10 million metadata objects averaging 2 KB each, the catalog metadata store holds 20 GB of structured metadata. The search index (Elasticsearch) with full-text content for all columns and descriptions is approximately 100 GB. Lineage graphs with 10 million nodes and 50 million edges (columns and their upstream/downstream dependencies) require a graph database capable of multi-hop traversals in under 100ms.

High-Level Architecture

The catalog architecture has three planes: Ingestion, Storage, and Serving. The Ingestion plane runs connector agents that periodically crawl metadata from sources: Hive Metastore crawler, Glue catalog sync, dbt manifest parser, Tableau workbook extractor, and Kafka Schema Registry sync. Connectors emit metadata events to a Kafka topic consumed by the Metadata Processing Service.

The Storage plane maintains three specialized stores: a relational database (PostgreSQL) for structured metadata (table definitions, ownership, tags), an Elasticsearch cluster for full-text search, and a graph database (Neo4j or Amazon Neptune) for lineage relationships. A write-through cache (Redis) sits in front of the relational store for hot metadata lookups (e.g., table schema for frequently queried tables).

The Serving plane exposes a GraphQL API for the web UI and REST endpoints for programmatic access. The search service executes Elasticsearch queries with boosting on dataset popularity (query frequency), recency of last update, and quality score. Auto-complete suggestions use Elasticsearch's completion suggester with a prefix query on a dedicated suggest index updated every 5 minutes.

Core Components

Metadata Ingestion Connectors

Each connector implements a standard interface: crawl() returns a stream of metadata events (CREATE_TABLE, ALTER_COLUMN, DROP_TABLE). Connectors run as Kubernetes CronJobs on configurable schedules (every 5 minutes for high-change sources, hourly for stable ones). A fingerprint hash of each metadata object detects changes, so connectors only emit delta events, reducing catalog write throughput by 95% for stable sources.

Search & Discovery Engine

Elasticsearch stores a document per catalog entity with fields for name, description, column names, tags, owner, source system, and popularity score. A custom scoring function ranks results by: exact name match > description keyword match > column match > tag match, multiplied by a freshness decay factor and a usage signal (query frequency from the warehouse query log). Faceted search supports filtering by source system, data classification, owner team, and last-updated recency.

Lineage Graph Engine

Column-level lineage is extracted from SQL query ASTs using a parser (sqlglot or calcite) that identifies source columns for each output column. dbt manifest.json files provide transformation-level lineage automatically. The graph database stores nodes (datasets and columns) and edges (DERIVED_FROM relationships with the transform SQL as edge metadata). Multi-hop lineage queries (find all downstream consumers of a column) execute as Cypher or Gremlin graph traversals.

Database Design

PostgreSQL schema: datasets (id, name, source_system, schema_json, owner_id, sensitivity_class, quality_score, created_at, updated_at), columns (id, dataset_id, name, type, description, business_term_id, pii_flag), business_terms (id, name, definition, steward_id), access_requests (id, requester_id, dataset_id, justification, status, decided_at). Elasticsearch index mirrors the dataset and column tables with denormalized fields for search performance.

API Design

GET /search?q={query}&filters={json}&page=1 — Full-text search with faceted filters; returns ranked list of dataset summaries. GET /datasets/{dataset_id}/lineage?direction=upstream&depth=5 — Return lineage DAG as nodes and edges in JSON graph format. POST /datasets/{dataset_id}/tags — Add classification tags (PII, Financial, Internal) to a dataset. POST /access-requests — Submit an access request for a restricted dataset with justification.

Scaling & Bottlenecks

Elasticsearch indexing throughput can bottleneck during mass re-indexing (e.g., after a schema change to the search document). Using bulk indexing API with 10,000-document batches and 3 indexing threads per Elasticsearch node sustains 50,000 documents/second. Index aliases allow zero-downtime re-indexing by building a new index in the background and atomically swapping the alias when complete.

The lineage graph database becomes a bottleneck for deep traversals (depth > 10) on large graphs. Materialized lineage summaries (pre-computed direct upstream/downstream sets per dataset) stored in Redis handle 90% of queries (which only need depth 1–2). Deep traversal queries are served asynchronously: the API accepts the request, runs the graph query in the background, and notifies the user via webhook when complete.

Key Trade-offs

Auto-discovery vs. manual curation: Auto-crawling maximizes coverage and freshness but captures noisy, undocumented datasets; a hybrid model auto-discovers but requires human curation before a dataset becomes publicly searchable.
Column-level vs. table-level lineage: Column lineage is more precise but requires SQL parsing which fails for complex or templated queries; table lineage is universally available and simpler to maintain.
Centralized vs. federated catalog: A single catalog gives unified search but creates a bottleneck for large organizations; a federated mesh catalog (one catalog per data domain) with cross-domain federation enables domain autonomy.
Graph DB vs. relational for lineage: Graph databases excel at multi-hop traversals; at smaller scale (< 10M edges), storing lineage as adjacency lists in PostgreSQL with recursive CTEs is sufficient and avoids operational complexity.