System Design: Model Registry & Versioning

Requirements

Functional Requirements:

Register trained model artifacts with metadata: training metrics, data lineage, framework version, and feature dependencies
Manage model lifecycle stages: CANDIDATE → STAGING → PRODUCTION → ARCHIVED
Support multiple model versions simultaneously; route inference traffic to specified versions
Enforce promotion gates: automated evaluation tests and optional human approval before PRODUCTION promotion
Provide model comparison: side-by-side metric comparison across versions
Enable rollback: one-click rollback to a previous production version

Non-Functional Requirements:

Artifact upload/download throughput: 1 GB/s for large deep learning model files
Metadata API latency under 50ms for model version queries
Model artifacts stored with 11-nines durability (S3 standard)
Audit log of all stage transitions retained for 7 years for compliance
Support 10,000 registered models with 100 versions each = 1 million version records

Scale Estimation

1,000 training jobs/day each producing a model artifact averaging 500 MB = 500 GB of new artifacts daily. After 3 years: ~550 TB of model artifacts (including old versions). With lifecycle archiving (moving ARCHIVED models to S3 Glacier after 90 days): active storage ~50 TB, cold storage ~500 TB. Metadata database: 1 million version records at 2 KB each = 2 GB — negligible.

High-Level Architecture

The Model Registry has three planes: Artifact Storage, Metadata Service, and Lifecycle Management. Artifact Storage uses S3 with versioned bucket configuration; each artifact is addressed by a content-hash URI ensuring immutability. The Metadata Service (PostgreSQL-backed REST API) stores model definitions, version metadata, metrics, tags, and stage history. Lifecycle Management handles stage transitions, promotion gate evaluation, and deployment coordination.

CI/CD integration is a first-class feature. Every training job in the ML platform automatically registers its output with the registry via a post-training hook. A promotion pipeline (GitHub Actions or GitLab CI) triggers evaluation gates: run offline evaluation on the holdout test set, compare AUC/RMSE against the current production model, run inference latency benchmarks, and check feature drift between training data and production traffic distribution. If all gates pass, the pipeline promotes the model to STAGING; human approval (via Slack approval button or JIRA) promotes to PRODUCTION.

The registry integrates with the serving platform: when a model is promoted to PRODUCTION, the registry publishes an event to Kafka. The serving platform's deployment controller consumes this event, pulls the new artifact from S3, loads it into the model server, and gradually shifts traffic (5% → 25% → 100% over 30 minutes) while monitoring error rate and latency.

Core Components

Artifact Store

Models are stored in S3 as versioned artifacts. The storage path follows the convention: s3://model-registry/{model_name}/{version_id}/. Each version directory contains: model/ (framework-specific weights), metadata.json (training config, metrics, feature list), requirements.txt (Python dependencies), and schema.json (input/output tensor shapes). Multipart upload handles artifacts >100 MB with parallel part uploads achieving 1 GB/s throughput. S3 Object Lock prevents deletion of PRODUCTION artifacts.

Metadata Service

The metadata service is a REST API backed by PostgreSQL. It stores model cards (structured documentation: intended use, training data characteristics, performance across demographic segments, known limitations) alongside technical metadata. Full-text search on model names and descriptions uses PostgreSQL's tsvector search. A change data capture (CDC) stream publishes all metadata mutations to Kafka, enabling downstream systems to react to registry changes without polling.

Promotion Gate Engine

Promotion gates are defined as YAML configurations: {gate_name: champion_challenger, type: metric_comparison, metric: auc_roc, threshold: 0.001, direction: higher_is_better}. The gate engine runs each gate sequentially; all gates must pass for promotion to proceed. Gate results are stored with the version record. Custom gates can invoke external services (e.g., bias and fairness evaluation, adversarial robustness testing) via webhook. Failed gates send a report to the model owner with specific metric comparisons.

Database Design

PostgreSQL schema: models (model_id, name, description, owner_team, framework, input_schema_json, output_schema_json, created_at), model_versions (version_id, model_id, version_number, artifact_s3_path, artifact_size_bytes, training_run_id, metrics_json, feature_group_ids JSON, status, created_at, created_by), stage_transitions (transition_id, version_id, from_stage, to_stage, transitioned_by, gate_results_json, transition_at), deployments (deployment_id, version_id, serving_endpoint, traffic_percent, deployed_at, is_active).

API Design

POST /models/{model_name}/versions — Register a new model version with artifact upload URL, metrics, and lineage metadata. POST /models/{model_name}/versions/{version_id}/promote — Request stage promotion; triggers gate evaluation and returns gate results. GET /models/{model_name}/production — Return the current production version metadata and serving endpoint. POST /models/{model_name}/rollback — Immediately roll back to the previous PRODUCTION version.

Scaling & Bottlenecks

Artifact upload throughput for large models (100 GB transformer models) requires multipart S3 upload with 100 MB parts uploaded in parallel. A presigned URL approach offloads upload bandwidth from the registry API server directly to S3, removing the server from the data path entirely. Artifact download for model serving uses S3 Transfer Acceleration and CloudFront CDN for serving clusters in multiple regions, reducing model load time from 5 minutes to under 60 seconds for large models.

Metadata service query throughput is modest (50,000 requests/day) and not a bottleneck. However, the full-text search for model discovery degrades as the model count grows; migrating to Elasticsearch for full-text search while keeping PostgreSQL for transactional metadata separates concerns and provides better search relevance ranking.

Key Trade-offs

Immutable artifact references vs. mutable: Content-addressed artifact paths guarantee reproducibility but require re-uploading even minor model updates; mutable paths are simpler but risk accidental overwriting of production artifacts.
Automated vs. human approval gates: Fully automated promotion gates move faster and reduce toil but cannot catch business logic issues or ethical concerns; human approval adds a quality check at the cost of deployment velocity.
Centralized registry vs. per-team: Centralized provides cross-team model discovery and reuse; per-team registries give autonomy but fragment visibility and risk duplicate model development.
Model card requirements: Mandatory model cards (documentation of intended use, limitations, training data) improve governance and prevent misuse but add overhead; making them optional initially with progressive enforcement balances adoption speed and governance goals.