System Design: Container Registry (Docker Hub-scale)

Requirements

Functional Requirements:

Push and pull container images using the OCI Distribution Specification
Store image layers with content-addressable hashing (SHA-256)
Support image tagging, versioning, and tag immutability policies
Provide access control: public registries, private repos, team-level permissions
Support image vulnerability scanning and metadata storage
Implement garbage collection to reclaim storage from unreferenced layers

Non-Functional Requirements:

Pull latency under 100ms for cache hits; push throughput 500 MB/s per client
99.99% availability with geo-replication across 3+ regions
Store petabytes of image data with layer deduplication reducing actual storage by 60–70%
Support 10 million registered users and 1 billion pulls per month

Scale Estimation

At Docker Hub scale: 10M images, average image size 500 MB (compressed layers), total raw storage ~5 PB before deduplication. With 70% layer sharing across images, effective unique storage ~1.5 PB. Pull traffic: 1 billion/month = ~385 pulls/sec average, with peaks at 5,000 pulls/sec during CI/CD business hours. Metadata (manifests, tags) stored separately: ~50 GB. Write traffic is asymmetric — pushes are rare compared to pulls (100:1 ratio).

High-Level Architecture

The registry exposes an HTTP API conforming to the OCI Distribution Spec. Clients (Docker CLI, containerd) authenticate via an OAuth2/JWT token service, then interact with the registry API for manifest and blob operations. The API layer is stateless and horizontally scalable behind a load balancer.

Blob storage is offloaded to object storage (S3-compatible) using content-addressable keys (sha256:). The registry API servers act as a proxy — they validate auth tokens, check manifests in a metadata database, and redirect or proxy blob requests to object storage. For large blobs, clients receive a signed URL for direct upload/download, bypassing the API servers entirely.

Geo-replication is achieved by replicating blobs to regional object storage buckets. A CDN sits in front of pull traffic, caching layers at edge PoPs. Cache hit rates exceed 90% for popular base images (ubuntu, alpine, node), dramatically reducing origin load.

Core Components

Token Authentication Service

Issues short-lived JWT tokens (Bearer tokens) encoding the requested scope (repository:name:pull,push). The registry validates tokens on every request without calling back to the auth service, keeping latency minimal. Tokens include expiry (typically 5 minutes) and are signed with RS256. A separate authorization service checks team membership and repository ACLs backed by a relational database (PostgreSQL).

Manifest Store

Image manifests (JSON documents listing layers and config) are stored in PostgreSQL keyed by digest and referenced by mutable tags. Tag writes use optimistic locking to prevent race conditions during concurrent pushes. A manifest can be referenced by multiple tags; garbage collection only deletes manifests with zero tag references and no referrers (for OCI referrers API). Manifests are small (~2 KB) and heavily read — a read-through cache (Redis) with 5-minute TTL reduces DB load by 95%.

Blob Storage Engine

Blobs (image layers) are stored in S3/GCS using the content digest as the key. The registry checks blob existence before accepting a push (using a HEAD request to object storage), enabling cross-repository blob mounting — if layer sha256:abc already exists, the push is instant regardless of which repo is pushing it. Resumable uploads use the OCI chunked upload protocol: clients POST to initiate, PATCH to append chunks, PUT to finalize. Each chunk is written to a temporary staging area, then assembled and moved to the canonical key on completion.

Database Design

PostgreSQL stores relational metadata: repositories, tags, manifests, and access control. Key tables: repositories (id, name, owner_id, visibility), manifests (digest, media_type, payload, size, created_at), tags (repo_id, name, manifest_digest, updated_at). Blobs are not tracked in SQL — existence is determined by object storage HEAD checks. A separate blob_references table links manifest digests to blob digests for GC traversal.

For vulnerability scan results and image metadata (labels, annotations), a document store (Elasticsearch or a JSONB column in Postgres) provides flexible querying. Time-series metrics (pull counts per image) are stored in a time-series database (TimescaleDB) for trending and rate limiting.

API Design

Scaling & Bottlenecks

The primary bottleneck during mass pull events (e.g., a new Kubernetes node pool spinning up) is blob egress bandwidth. Mitigation: CDN caching at the edge, P2P layer distribution (Dragonfly, Kraken) within a datacenter to avoid all nodes hitting the registry simultaneously. API servers scale horizontally; the only stateful component is PostgreSQL, which is scaled with read replicas and connection pooling (PgBouncer).

Garbage collection is a challenge at scale: a naive mark-and-sweep requires locking or causes inconsistencies during concurrent pushes. Production systems use a two-phase approach — first mark unreferenced blobs, then soft-delete with a grace period (24–48 hours) before physical deletion, allowing in-flight uploads to complete without interference.

Key Trade-offs

Direct upload vs. proxied upload: Signed URLs reduce API server load but require clients to have direct object storage access; proxied uploads simplify networking but bottleneck on API servers
Tag mutability: Mutable tags enable latest semantics but complicate reproducibility; immutable tags (enforced policy) improve supply-chain security at the cost of workflow changes
Strong vs. eventual consistency for tags: Strong consistency (serializable writes in Postgres) prevents tag conflicts but adds latency; eventual consistency allows faster geo-distributed writes with risk of tag clobber
CDN TTL vs. freshness: Long CDN TTLs reduce origin load but mean tag updates (e.g., patching latest) don't propagate immediately; digest-based pulls are always fresh since content is immutable