System Design: GitLab

Requirements

Functional Requirements:

Git repository hosting with branch protection and code review
Integrated CI/CD pipelines defined as YAML (.gitlab-ci.yml)
Container registry for Docker images
Issue tracking, epics, and project boards
Package registry (npm, Maven, PyPI, Docker)
SAST/DAST security scanning integrated in pipelines

Non-Functional Requirements:

Support 10 million projects and 30 million users (GitLab.com scale)
CI pipeline job latency: under 60 seconds from trigger to job start
99.95% availability
Multi-tenant isolation: one customer's pipeline cannot affect another's
Horizontal scaling: add GitLab Runners to increase CI capacity

Scale Estimation

GitLab.com runs 10 million CI/CD pipeline jobs per day (~115 jobs/sec). Each job averages 5 minutes of compute, requiring a fleet of ~34,500 concurrent runner vCPUs at steady state. With 4 vCPU per runner, that's ~8,625 runner VMs. Peak load (Monday morning, post-merge rushes) is 3x average: ~26,000 runner VMs. Repository storage: 10 million projects at 100 MB average = 1 PB. Container registry: 100 million Docker image layers at 50 MB each = 5 PB. PostgreSQL stores all structured metadata: projects, users, MRs, issues, pipeline records — estimated 10 TB of database data.

High-Level Architecture

GitLab is a Rails monolith with service extraction in progress. Core components: Workhorse (Go, handles large file uploads/downloads, bypassing Rails for performance), Gitaly (Git RPC service, same as GitHub's architecture), Sidekiq (background job processing), GitLab Runner (CI/CD job executor), Container Registry (Docker registry), and PostgreSQL + Redis for data storage. A coordinator service dispatches CI jobs from a queue to available runners. Runners poll the coordinator, receive job assignments, execute jobs in isolated environments (Docker containers or VMs), and report results back.

CI/CD pipeline flow: a git push triggers a webhook from Gitaly to the GitLab API. The pipeline YAML is parsed, validating syntax and resolving includes. Pipeline stages and jobs are created as database records. Jobs are queued in a Redis-backed priority queue. GitLab Runners (registered with the coordinator) poll the queue via long-polling (/api/v4/jobs/request). The coordinator assigns the next queued job to the polling runner, returns job configuration (image, script, variables, artifacts). The runner starts a Docker container (or VM), executes the job script, streams logs back to GitLab via a streaming log API, and uploads artifacts to object storage on job completion.

Multi-tenant isolation is achieved via Docker-in-Docker (DinD) or Kubernetes pod sandboxing. Each CI job runs in an ephemeral Docker container or Kubernetes pod with no persistent state. Network policies (Kubernetes NetworkPolicy) prevent inter-job communication. Resource limits (CPU, memory, disk) are enforced via container cgroups. Privileged containers are disallowed by default (except for specific Docker-related jobs). GitLab.com uses dedicated auto-scaling runner groups for different job types (Linux, macOS, Windows, GPU) powered by GCP, AWS, or bare-metal.

Core Components

GitLab Runner Architecture

Runners are the CI/CD compute agents. Each runner is a Go binary that supports multiple executors: Shell (run scripts directly on runner host), Docker (run each job in a fresh container), Kubernetes (create a pod per job), VirtualBox (run in a VM). Runners register with the GitLab coordinator using a registration token, establishing a long-lived relationship. Multiple runners can be registered per project (for parallelism) or shared across a GitLab instance (instance runners). Auto-scaling runners (via gitlab-runner on AWS/GCP) use cloud instance APIs to spin up new VMs when queue depth exceeds thresholds and terminate them when idle, enabling elastic CI capacity.

Pipeline Scheduler & Queue

The pipeline scheduler determines job execution order based on stage dependencies (jobs in the same stage run in parallel; the next stage starts only after all jobs in the current stage succeed). Jobs are stored in PostgreSQL with state transitions (created → pending → running → success/failed). Pending jobs are mirrored to a Redis priority queue for fast runner polling. The coordinator's /api/v4/jobs/request endpoint is called by runners every 3 seconds (long-poll with 50-second timeout). The coordinator dequeues the next eligible job (checking runner tags, job-runner compatibility) and returns it to the polling runner atomically (RPOPLPUSH in Redis for atomic dequeue + in-flight tracking).

Container Registry

GitLab's container registry is a fork of Docker Distribution (now CNCF Distribution). Images are stored as content-addressed layers in object storage (GCS or S3). The manifest (list of layers + config) is stored in the registry database (PostgreSQL). On docker push, layers are streamed directly to object storage via the Workhorse upload bypass (Rails is not in the data path for large uploads). On docker pull, the manifest is fetched from PostgreSQL and layers are redirected (HTTP 307) to presigned object storage URLs, enabling client download directly from storage without proxying through the registry service. Image garbage collection (removing unreferenced layers) runs as a background job.

Database Design

PostgreSQL stores all GitLab metadata with extensive normalization: projects (id, namespace_id, path, visibility, repository_size, ci_config_path), merge_requests (id, target_project_id, source_branch, target_branch, state, author_id), ci_pipelines (id, project_id, sha, ref, status, created_at), ci_builds (id, pipeline_id, stage, name, status, runner_id, started_at, finished_at), users (id, username, email, created_at). PostgreSQL is scaled via read replicas (PgBouncer connection pooling + streaming replication) and Patroni for HA/automatic failover. GitLab uses Gitaly for repository data and PostgreSQL exclusively for relational data — no MongoDB or Cassandra.

API Design

Scaling & Bottlenecks

PostgreSQL is GitLab's most significant scaling bottleneck. GitLab.com's PostgreSQL cluster handles 100,000+ queries/sec at peak. Optimizations: (1) aggressive query optimization (pg_stat_statements to identify slow queries); (2) partial indexes (e.g., index on ci_builds WHERE status='pending' for fast queue scans); (3) table partitioning (ci_builds partitioned by created_at, with old partitions archived and detached); (4) read replicas for analytics queries; (5) Sidekiq async processing to defer non-critical writes. Database connection pooling via PgBouncer (transaction mode) is essential: Rails opens a connection per request, PgBouncer multiplexes 10,000 Rails connections onto 200 PostgreSQL connections.

CI/CD queue latency spikes during mass push events (e.g., midnight automated commits). A 1,000-job spike requires 1,000 runner-seconds of capacity immediately. Auto-scaling handles this within 3–5 minutes (VM provisioning time), but the first 3 minutes have higher queue wait times. Pre-provisioning a pool of warm runner VMs (always-on standby capacity) reduces response time for spikes at the cost of idle VM expenses. Spot/preemptible instances reduce idle capacity cost by 70%.

Key Trade-offs

Shared runners vs. dedicated runners: Shared runners on GitLab.com provide elastic capacity but share compute with all tenants; dedicated runners give teams full control and predictable performance but require self-managed infrastructure
Docker executor vs. Kubernetes executor: Docker executor on VMs is simpler and provides stronger isolation (full VM per job) but is slower to start (VM boot time); Kubernetes executor starts containers in seconds but provides weaker isolation (shared kernel)
Monolith vs. microservices for GitLab: The Rails monolith allows rapid feature development and is easier to operate; gradual service extraction (Gitaly, Container Registry) improves scalability for specific bottlenecks without full decomposition complexity
PostgreSQL vs. NoSQL for CI data: PostgreSQL's ACID transactions ensure reliable CI state transitions (a job cannot be double-assigned to two runners); NoSQL would require application-level transaction logic for the same guarantees