SYSTEM_DESIGN
System Design: GitLab
Design a self-hosted and cloud Git platform like GitLab with integrated CI/CD, container registry, and DevSecOps capabilities. Covers runner architecture, pipeline execution, and multi-tenant isolation.
Requirements
Functional Requirements:
- Git repository hosting with branch protection and code review
- Integrated CI/CD pipelines defined as YAML (.gitlab-ci.yml)
- Container registry for Docker images
- Issue tracking, epics, and project boards
- Package registry (npm, Maven, PyPI, Docker)
- SAST/DAST security scanning integrated in pipelines
Non-Functional Requirements:
- Support 10 million projects and 30 million users (GitLab.com scale)
- CI pipeline job latency: under 60 seconds from trigger to job start
- 99.95% availability
- Multi-tenant isolation: one customer's pipeline cannot affect another's
- Horizontal scaling: add GitLab Runners to increase CI capacity
Scale Estimation
GitLab.com runs 10 million CI/CD pipeline jobs per day (~115 jobs/sec). Each job averages 5 minutes of compute, requiring a fleet of ~34,500 concurrent runner vCPUs at steady state. With 4 vCPU per runner, that's ~8,625 runner VMs. Peak load (Monday morning, post-merge rushes) is 3x average: ~26,000 runner VMs. Repository storage: 10 million projects at 100 MB average = 1 PB. Container registry: 100 million Docker image layers at 50 MB each = 5 PB. PostgreSQL stores all structured metadata: projects, users, MRs, issues, pipeline records — estimated 10 TB of database data.
High-Level Architecture
GitLab is a Rails monolith with service extraction in progress. Core components: Workhorse (Go, handles large file uploads/downloads, bypassing Rails for performance), Gitaly (Git RPC service, same as GitHub's architecture), Sidekiq (background job processing), GitLab Runner (CI/CD job executor), Container Registry (Docker registry), and PostgreSQL + Redis for data storage. A coordinator service dispatches CI jobs from a queue to available runners. Runners poll the coordinator, receive job assignments, execute jobs in isolated environments (Docker containers or VMs), and report results back.
CI/CD pipeline flow: a git push triggers a webhook from Gitaly to the GitLab API. The pipeline YAML is parsed, validating syntax and resolving includes. Pipeline stages and jobs are created as database records. Jobs are queued in a Redis-backed priority queue. GitLab Runners (registered with the coordinator) poll the queue via long-polling (/api/v4/jobs/request). The coordinator assigns the next queued job to the polling runner, returns job configuration (image, script, variables, artifacts). The runner starts a Docker container (or VM), executes the job script, streams logs back to GitLab via a streaming log API, and uploads artifacts to object storage on job completion.
Multi-tenant isolation is achieved via Docker-in-Docker (DinD) or Kubernetes pod sandboxing. Each CI job runs in an ephemeral Docker container or Kubernetes pod with no persistent state. Network policies (Kubernetes NetworkPolicy) prevent inter-job communication. Resource limits (CPU, memory, disk) are enforced via container cgroups. Privileged containers are disallowed by default (except for specific Docker-related jobs). GitLab.com uses dedicated auto-scaling runner groups for different job types (Linux, macOS, Windows, GPU) powered by GCP, AWS, or bare-metal.
Core Components
GitLab Runner Architecture
Runners are the CI/CD compute agents. Each runner is a Go binary that supports multiple executors: Shell (run scripts directly on runner host), Docker (run each job in a fresh container), Kubernetes (create a pod per job), VirtualBox (run in a VM). Runners register with the GitLab coordinator using a registration token, establishing a long-lived relationship. Multiple runners can be registered per project (for parallelism) or shared across a GitLab instance (instance runners). Auto-scaling runners (via gitlab-runner on AWS/GCP) use cloud instance APIs to spin up new VMs when queue depth exceeds thresholds and terminate them when idle, enabling elastic CI capacity.
Pipeline Scheduler & Queue
The pipeline scheduler determines job execution order based on stage dependencies (jobs in the same stage run in parallel; the next stage starts only after all jobs in the current stage succeed). Jobs are stored in PostgreSQL with state transitions (created → pending → running → success/failed). Pending jobs are mirrored to a Redis priority queue for fast runner polling. The coordinator's /api/v4/jobs/request endpoint is called by runners every 3 seconds (long-poll with 50-second timeout). The coordinator dequeues the next eligible job (checking runner tags, job-runner compatibility) and returns it to the polling runner atomically (RPOPLPUSH in Redis for atomic dequeue + in-flight tracking).
Container Registry
GitLab's container registry is a fork of Docker Distribution (now CNCF Distribution). Images are stored as content-addressed layers in object storage (GCS or S3). The manifest (list of layers + config) is stored in the registry database (PostgreSQL). On docker push, layers are streamed directly to object storage via the Workhorse upload bypass (Rails is not in the data path for large uploads). On docker pull, the manifest is fetched from PostgreSQL and layers are redirected (HTTP 307) to presigned object storage URLs, enabling client download directly from storage without proxying through the registry service. Image garbage collection (removing unreferenced layers) runs as a background job.
Database Design
PostgreSQL stores all GitLab metadata with extensive normalization: projects (id, namespace_id, path, visibility, repository_size, ci_config_path), merge_requests (id, target_project_id, source_branch, target_branch, state, author_id), ci_pipelines (id, project_id, sha, ref, status, created_at), ci_builds (id, pipeline_id, stage, name, status, runner_id, started_at, finished_at), users (id, username, email, created_at). PostgreSQL is scaled via read replicas (PgBouncer connection pooling + streaming replication) and Patroni for HA/automatic failover. GitLab uses Gitaly for repository data and PostgreSQL exclusively for relational data — no MongoDB or Cassandra.
API Design
Scaling & Bottlenecks
PostgreSQL is GitLab's most significant scaling bottleneck. GitLab.com's PostgreSQL cluster handles 100,000+ queries/sec at peak. Optimizations: (1) aggressive query optimization (pg_stat_statements to identify slow queries); (2) partial indexes (e.g., index on ci_builds WHERE status='pending' for fast queue scans); (3) table partitioning (ci_builds partitioned by created_at, with old partitions archived and detached); (4) read replicas for analytics queries; (5) Sidekiq async processing to defer non-critical writes. Database connection pooling via PgBouncer (transaction mode) is essential: Rails opens a connection per request, PgBouncer multiplexes 10,000 Rails connections onto 200 PostgreSQL connections.
CI/CD queue latency spikes during mass push events (e.g., midnight automated commits). A 1,000-job spike requires 1,000 runner-seconds of capacity immediately. Auto-scaling handles this within 3–5 minutes (VM provisioning time), but the first 3 minutes have higher queue wait times. Pre-provisioning a pool of warm runner VMs (always-on standby capacity) reduces response time for spikes at the cost of idle VM expenses. Spot/preemptible instances reduce idle capacity cost by 70%.
Key Trade-offs
- Shared runners vs. dedicated runners: Shared runners on GitLab.com provide elastic capacity but share compute with all tenants; dedicated runners give teams full control and predictable performance but require self-managed infrastructure
- Docker executor vs. Kubernetes executor: Docker executor on VMs is simpler and provides stronger isolation (full VM per job) but is slower to start (VM boot time); Kubernetes executor starts containers in seconds but provides weaker isolation (shared kernel)
- Monolith vs. microservices for GitLab: The Rails monolith allows rapid feature development and is easier to operate; gradual service extraction (Gitaly, Container Registry) improves scalability for specific bottlenecks without full decomposition complexity
- PostgreSQL vs. NoSQL for CI data: PostgreSQL's ACID transactions ensure reliable CI state transitions (a job cannot be double-assigned to two runners); NoSQL would require application-level transaction logic for the same guarantees
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.