System Design: Container Orchestration (Kubernetes-scale)

Requirements

Functional Requirements:

Schedule containers onto nodes based on resource requests, affinity rules, and taints/tolerations
Maintain desired replica count: restart failed containers, reschedule evicted pods
Roll out updates with zero downtime: rolling updates, blue-green, canary strategies
Auto-scale workloads based on CPU, memory, and custom metrics (HPA/VPA)
Service discovery and load balancing for containerized services
Manage configuration and secrets injection into containers

Non-Functional Requirements:

Support clusters of 5,000 nodes and 150,000 pods per cluster
Scheduling latency under 1 second per pod from submission to node assignment
Control plane availability 99.99% — cluster must tolerate control plane failures without disrupting running workloads
API server throughput: 1,000 write requests/sec and 10,000 read requests/sec

Scale Estimation

At Kubernetes' documented limits: 5,000 nodes, 150,000 pods, 300,000 containers. The etcd cluster storing all state: ~1.5 GB of data (all pod specs, node status, ConfigMaps, Secrets). API server handles 10,000 watch connections from controllers and kubelets — every state change fans out to all watchers. The scheduler evaluates pod placement: at 150,000 pods on 5,000 nodes, a naive O(pods × nodes) algorithm runs 750M comparisons — requires aggressive filtering and sampling. Control plane: 3-5 API server instances behind a load balancer; 3-node etcd cluster.

High-Level Architecture

Kubernetes follows a hub-and-spoke control plane model. The API server is the single entry point — all components read and write through it. State is persisted in etcd, a distributed strongly-consistent key-value store using Raft consensus. The API server is stateless; multiple replicas run behind a load balancer. Controllers (Deployment controller, ReplicaSet controller, Node controller, etc.) are reconciliation loops that watch the API server for desired state and take actions to converge actual state toward desired state.

The scheduler watches for unscheduled pods (pods with no nodeName assigned) and selects a node for each. It does not execute the scheduling — it writes the nodeName back to the pod's spec via the API server. Each node runs a kubelet that watches for pods assigned to its node and starts/stops containers via the container runtime (containerd or CRI-O). The kube-proxy on each node maintains iptables/ipvs rules for Service VIP routing.

The watch mechanism is the central nervous system: components don't poll the API server — they establish long-lived HTTP/2 streaming watches. The API server buffers recent events in a watch cache (backed by etcd's watch API) and delivers them to watchers. This reduces etcd load and enables sub-second propagation of state changes across the entire system.

Core Components

Scheduler

The scheduler uses a two-phase algorithm. Filtering: eliminate nodes that cannot run the pod (insufficient CPU/memory, missing labels for nodeSelector, taints without matching tolerations, violated affinity rules). Scoring: rank remaining feasible nodes using weighted scoring functions (LeastRequestedPriority: prefer nodes with most free resources; BalancedResourceAllocation: prefer nodes where CPU and memory are balanced; NodeAffinityPriority: prefer nodes matching soft affinity). The highest-scoring node wins. At scale, running all scoring plugins on all 5,000 nodes is expensive — the scheduler samples a configurable percentage (50% of nodes by default above 100 nodes) for scoring, preserving quality while bounding latency.

Controller Manager

All controllers run in a single binary (kube-controller-manager) with leader election — only one instance is active at a time, preventing split-brain. Each controller is an independent reconciliation loop: watch for objects of its type, compute delta between desired and actual state, take corrective actions. The Deployment controller watches Deployments and manages ReplicaSets (creates/scales them). The ReplicaSet controller watches ReplicaSets and manages Pods (creates/deletes them). The Node controller watches Node heartbeats (kubelet updates node.status.conditions every 10 seconds) and marks nodes as NotReady after 40 seconds of missed heartbeats, triggering pod eviction after 5 minutes.

etcd

etcd uses Raft for strong consistency. All writes to Kubernetes state go through etcd — a pod creation is a write to etcd, acknowledged only after a majority of etcd nodes commit. Kubernetes stores all objects as protobuf-encoded values under path-structured keys (/registry/pods/default/mypod). The API server's watch cache keeps a ring buffer of recent events (default 100 events per resource type) to serve watch catch-up without hitting etcd for every reconnecting watcher. etcd compaction runs periodically to remove old revisions and bound memory growth.

Database Design

etcd is the only persistent store. All Kubernetes objects (Pods, Deployments, Services, ConfigMaps, Secrets, etc.) are stored as protobuf-encoded values with JSON API compatibility. Keys follow a hierarchical pattern: /registry/{resource}/{namespace}/{name}. Resource versions (etcd revision numbers) are used for optimistic concurrency — every API write includes the last-known resourceVersion; if the object has been modified since, the write fails with a 409 Conflict, and the client must re-read and retry.

For large clusters, etcd performance is preserved by: keeping object sizes small (ConfigMaps capped at 1 MB, etcd total size capped at 8 GB), using separate etcd clusters for high-churn resources (Events have their own etcd cluster to avoid starving Pod state), and running etcd on NVMe SSDs (fsync latency is the primary write bottleneck).

API Design

Scaling & Bottlenecks

The API server becomes a bottleneck at large scale due to watch fan-out: 10,000 watchers × every pod event = high CPU serializing and sending events. Mitigation: the watch cache serves watches from memory without hitting etcd; API priority and fairness (APF) throttles low-priority requests (bulk list operations) to protect high-priority scheduling traffic. etcd write latency grows with cluster size as more objects are written per second — keeping etcd on fast local NVMe storage (not networked storage) is critical.

The scheduler's throughput limits pod creation rate. A single scheduler handles ~100 pods/sec; above this, pending pod queue grows. Mitigation: the scheduler uses worker goroutines for parallel node scoring; multiple scheduler instances can run with different scheduler names for different workload classes (batch jobs vs. latency-sensitive services).

Key Trade-offs

Centralized etcd vs. distributed storage: etcd's strong consistency ensures controllers never act on stale state, but its single-leader write path limits write throughput; distributed storage (CockroachDB) would improve write scaling but adds complexity
Imperative vs. declarative API: Declarative (desired state) enables self-healing and GitOps workflows but requires controllers to reconcile state continuously; imperative would be simpler but loses self-healing
In-tree vs. out-of-tree plugins: Moving cloud-provider and storage plugins out-of-tree (CSI, CCM) decouples their release cycles but fragments the ecosystem
Pod-level vs. container-level scheduling: Scheduling at the pod level (all containers in a pod are co-located) enables efficient sidecar patterns but reduces scheduling flexibility compared to per-container placement