SYSTEM_DESIGN
System Design: Package Registry (npm/PyPI-scale)
Design a package registry at npm or PyPI scale supporting billions of package downloads per day, semantic versioning, dependency resolution, and global distribution via CDN. Covers immutable package storage, dependency graphs, and security scanning.
Requirements
Functional Requirements:
- Publish, retrieve, and deprecate versioned software packages
- Semantic versioning with dependency resolution
- Search packages by name, keyword, and author
- Package metadata: README, changelog, license, dependencies
- Security: vulnerability scanning, malware detection, typosquatting protection
- Scoped packages and private registries for organizations
Non-Functional Requirements:
- 10 billion package downloads per day (npm scale)
- Sub-100ms latency for package metadata; sub-500ms for package tarball download start
- Immutable package versions — published versions cannot be modified
- 99.99% download availability
- 5 PB of package storage
Scale Estimation
npm serves ~10 billion package downloads/day = ~115,000 downloads/sec. The npm registry has 2.5 million packages with 20 million total versions. Average package tarball size: 50 KB (compressed). Total storage: 20M versions × 50 KB = 1 PB for tarballs + metadata. With CDN caching at 90% hit rate, origin receives 11,500 download requests/sec. CDN serves 103,500 requests/sec from edge. Metadata API (npm install reads package.json, fetches metadata for all transitive deps): a typical npm install resolves 100 packages, generating 100 metadata API calls. At 1 million concurrent npm installs: 100 million metadata API calls in progress simultaneously.
High-Level Architecture
The registry has two primary surfaces: the metadata API (package.json, version manifests, search) and the tarball download service (the actual package files). The metadata API is served by an application tier backed by a database (CouchDB in npm's case, or PostgreSQL with JSON columns). Tarball downloads are served primarily from CDN, with origin fallback to object storage. A publish pipeline handles incoming package uploads: validation, security scanning, metadata extraction, and storage. A dependency resolver aids clients in computing compatible version ranges.
Package publish flow: developer runs npm publish. The CLI sends a tarball + package.json to the registry API. The API authenticates the publisher (npm token → user lookup), validates the package manifest (valid semver, required fields), checks for name squatting (is the package name already taken by another user?), runs a fast synchronous validation (size limit 50 MB, no native binaries without disclosure), queues the package for async security scanning (SAST, known-malware hashes, dependency audit), and immediately stores the tarball in object storage + writes metadata to the database. The package is published (visible to downloaders) after synchronous validation passes; security scan results are surfaced asynchronously.
Package download flow: npm install express@4.18.0 → registry API resolves the exact version (or version range), returns the dist-tags manifest pointing to the tarball URL. The tarball URL points to the CDN (e.g., https://registry.npmjs.org/express/-/express-4.18.0.tgz). The CDN checks its cache; on hit, streams the tarball directly. On miss, the CDN fetches from origin object storage, caches it with a long TTL (tarballs are immutable — no TTL limit needed), and streams to the client. Package tarballs are treated as immutable: once published, the content never changes, enabling aggressive indefinite CDN caching.
Core Components
Package Metadata Store
Package metadata is stored in a document-oriented store (CouchDB in npm's production system). Each package has a top-level document (the "package document") containing all versions' metadata: {name, description, keywords, latest, dist-tags, versions: {"1.0.0": {dependencies, peerDependencies, devDependencies, dist: {tarball, shasum, integrity}, engines, scripts}}}. Package documents can be large (lodash has 100+ versions, making its document several MB). For PostgreSQL-based implementations, a normalized schema separates packages (id, name, description, created_at, downloads_total) from package_versions (id, package_id, version, manifest JSON, tarball_sha256, published_at). The manifest JSON is stored as a JSONB column for flexible querying.
Immutable Tarball Storage
Tarballs are stored in object storage (S3/GCS) keyed by content hash (SHA-256 of the tarball contents). Immutability is enforced at the storage layer: once a tarball is written under a content hash key, the content never changes (object storage versioning or WORM policy prevents overwrites). The metadata record stores the expected SHA-256 (the integrity field in package.json). Clients verify the downloaded tarball's hash against the expected value, detecting any tampering or corruption in transit. CDN edge nodes also verify content integrity on cache population. Package "unpublish" does not delete the tarball from storage — it only removes the metadata record, making the package undiscoverable (per npm's unpublish policy: packages older than 72 hours cannot be fully deleted).
Security Scanning Pipeline
Every published package version is scanned by an async security pipeline. Stage 1: static analysis — scan the package source for known malware signatures (typosquatting against popular package names using string edit distance), suspicious patterns (outbound network calls in install scripts, file system access outside package directory). Stage 2: dependency audit — check all declared dependencies against the CVE/GHSA (GitHub Security Advisory) database; surface vulnerable transitive dependencies. Stage 3: provenance verification (npm provenance attestation, 2023) — verify the package was built from a known source repo via a trusted CI/CD system using SLSA attestations. Flagged packages are held for human review before becoming publicly downloadable.
Database Design
Package metadata uses PostgreSQL (JSONB for flexibility, relational for structured queries): packages (id, name, latest_version, description, homepage, keywords[], downloads_last_week, created_at), package_versions (id, package_id, version, manifest JSONB, tarball_url, tarball_sha256, published_at, deprecated: bool, deprecated_message). A full-text search index on (name, description, keywords) enables package search. A dependency graph is stored as an edge table: (dependent_package_id, dependent_version, dependency_package_name, version_range) — used for vulnerability impact analysis ("which packages are affected by a vulnerability in lodash@4.17.15?"). Download counts are stored in a time-series database (InfluxDB or BigQuery) partitioned by day, queried for the "weekly downloads" badge.
API Design
Scaling & Bottlenecks
Metadata API is the primary scaling bottleneck for npm install. When npm install resolves 100 packages, it makes 100 sequential or parallel metadata API requests. With 1 million concurrent installs, the metadata API receives 100 million requests in a short window (all starting their install at the same time, e.g., after a deployment kicks off 10,000 parallel CI jobs). Metadata documents for popular packages (react, lodash, express) are cached in Redis (TTL: 60 seconds). Cache hit rate for the top 100 packages exceeds 99%. For cache misses, the database serves the request with a read replica. Rate limiting per IP (1,000 requests/sec per IP) prevents individual bad actors from overwhelming the registry.
Dependency resolution at npm scale creates a "dependency graph" problem. Running npm install on a project with 100 direct dependencies might transitively pull in 1,000 packages. Each version range (^4.18.0) requires checking available versions against the range constraint. For common dependency combinations, precomputed lock files (package-lock.json) cache the full resolution, eliminating repeat resolution work. Registry-side dependency resolution APIs (returning the full resolved graph for a given package.json) are offered as a performance optimization, computing the resolution once and caching for repeated requests.
Key Trade-offs
- Immutability vs. deletion rights: Immutable packages simplify CDN caching infinitely (no invalidation needed) and prevent supply chain attacks via version mutation; but it prevents publishers from removing accidentally published secrets or compliance violations; npm's 72-hour unpublish window balances both concerns
- Centralized vs. federated registry: A central registry (npm, PyPI) simplifies discovery and security scanning but creates a single point of trust and failure; federated registries (private registries, scoped packages) improve organizational control but fragment the ecosystem
- Sync vs. async security scanning: Blocking publish until security scanning completes ensures no malicious packages are ever published but adds 30–120 seconds to publish latency; async scanning is faster to publish but allows a brief window where malicious packages are downloadable
- Package document size vs. query flexibility: Storing all versions in one document (CouchDB style) enables fetching all version data in one read but produces large documents for popular packages; normalized relational storage is more query-flexible but requires joins for full package manifests
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.