System Design: Health Check & Heartbeat System

Requirements

Functional Requirements:

Expose health check endpoints on every service: liveness (is the process alive?) and readiness (is it ready to serve traffic?)
Detect unhealthy instances within 10 seconds and remove from load balancer rotation
Aggregate health status for dependencies: database connections, cache reachability, downstream services
Distinguish between temporary degradation (not ready) and permanent failure (not alive)
Provide a global health dashboard: overall system health, per-service status, dependency graph
Support scheduled maintenance: mark instances as draining without triggering alerts

Non-Functional Requirements:

Health check probes complete in under 100ms (must not block on slow operations)
False positive rate under 0.1%: healthy instances incorrectly marked as failing
Monitor 100,000 service instances across 1,000 services
Health status propagates to load balancers and service discovery within 5 seconds of failure

Scale Estimation

100,000 instances, health checked every 10 seconds = 10,000 health checks/sec. Each check is an HTTP GET to /healthz returning a JSON payload (~500 bytes). Network: 10,000 × 500 bytes = 5 MB/sec from health check responses — negligible. Health check infrastructure: if centralized (one checker polls all instances), 10,000 HTTP requests/sec per checker server — easily handled by a Go or C++ HTTP client with connection pooling. Decentralized (each load balancer node checks its backends) distributes the load: 100 LB nodes × 100 backends each = 10,000 checks/sec distributed across 100 nodes = 100/sec per LB node.

High-Level Architecture

Health checks operate at two levels: active (the health check system initiates requests to services) and passive (the service itself reports status via heartbeats or the load balancer infers health from real traffic error rates).

Active health checks: the load balancer or a dedicated health check service sends HTTP/TCP/gRPC probes to each service instance at regular intervals. If N consecutive probes fail (N=2 or 3), the instance is marked unhealthy. Active checks are the most reliable for detecting dead instances but add probe traffic and require network reachability between the checker and checked.

Passive health checks (outlier detection): infer health from real traffic. If an upstream backend returns 5xx responses for >10% of requests in a rolling 10-request window, Envoy's outlier detection ejects it from the load balancing pool for a configurable period (30 seconds base, doubling on repeated ejections). Passive checks detect real-traffic failures (not just probe failures) and respond faster than waiting for the next active probe cycle.

The Kubernetes probe model separates two failure modes. Liveness probes: if failing, kubelet restarts the container (process is stuck — maybe deadlocked). Readiness probes: if failing, the pod is removed from Service endpoints (process is alive but not ready — warming up, DB connection failing). This separation prevents restart loops for temporarily overloaded services while still removing them from traffic.

Core Components

Liveness vs. Readiness Probe Design

Liveness checks must be fast and independent of external dependencies — they should only verify the process is alive and not deadlocked. A liveness endpoint that checks database connectivity will cause container restarts if the DB is temporarily slow — a very undesirable behavior. Liveness: return 200 if the process can respond to HTTP requests. Readiness: check actual service capability — can it connect to its primary DB? Is its in-memory cache warm? Is it within SLO for current request latency? Startup probe: a third probe type with a longer initial timeout (for slow-starting services like JVM applications) that transitions to the liveness probe after the first success.

Dependency Health Aggregation

A /healthz/detailed endpoint returns the health of each dependency: {status: degraded, checks: [{name: postgresql, status: ok, latency_ms: 5}, {name: redis, status: failing, error: connection refused}, {name: payment-service, status: ok}]}. The aggregated status uses a severity model: if any critical dependency is failing → unhealthy (remove from LB); if a non-critical dependency is failing → degraded (keep in LB but reduce weight). Critical vs. non-critical is defined by the service team. This granularity enables targeted debugging — a dashboard shows exactly which dependency is causing a service's health degradation.

Health Status Propagation

When a health check fails, the status must propagate to all consumers (load balancers, service discovery, alerting) within 5 seconds. In Kubernetes: when a readiness probe fails, the kubelet updates the pod's readiness condition; the endpoint controller watches pod conditions and removes the pod's IP from the service's Endpoints object; the kube-proxy on each node watches Endpoints and updates iptables/ipvs rules. Total propagation: probe failure → kubelet update (0-10s depending on probe period) → endpoint controller (1-2s) → kube-proxy (1-2s) = worst case 14 seconds. To reduce this: decrease readiness probe period (5s instead of 10s) and increase failure threshold count to avoid false positives.

Database Design

Health check state is ephemeral — no database needed for the health check values themselves. Current health status is stored in Redis with TTL (key = service:instance_id, value = {status, last_check, failure_count}, TTL = 30 seconds). If a health check daemon crashes and stops updating, keys expire and the instance is treated as unhealthy (fail-safe).

For the global health dashboard, PostgreSQL stores historical health data: health_events (service, instance_id, status: healthy/unhealthy/degraded, detected_at, resolved_at, failure_reason). This data powers SLA reporting (uptime % per service per month), incident post-mortems, and trend analysis (services with increasing failure frequency need investigation).

API Design

Scaling & Bottlenecks

Centralized active health checking (one service checks all 100,000 instances) creates a bottleneck if the checker is slow. Mitigation: shard the checker by service or instance range; use co-located checking (the load balancer checks its own backends, not a central service). At 10,000 checks/sec with 100ms timeout, a checker needs 1,000 concurrent HTTP connections — a connection-pooled HTTP client handles this easily.

False positive mitigation is an operational scaling challenge. Network glitches cause transient check failures. Two consecutive failure threshold (not one) and a recovery threshold (two consecutive successes to re-add) provides hysteresis that eliminates most false positives while maintaining fast failure detection.

Key Trade-offs

Active vs. passive health checks: Active checks detect failures even during idle traffic but add overhead; passive checks are free (inferred from real traffic) but require sufficient traffic volume to detect failures quickly
Tight vs. loose health thresholds: Low failure threshold (1 failed check) detects issues faster but increases false positives; higher threshold (3 failures) is more stable but delays detection
Liveness combined with dependency checks: Combining DB connectivity into liveness probes seems thorough but causes container restarts during DB outages — the wrong response to a DB failure; keep liveness minimal and dependency checks in readiness only
Push vs. pull health reporting: Services pushing their health status to a central registry is more reliable for detecting silent failures (if push stops, service is assumed unhealthy) but requires all services to implement push; pull (central checker) is easier to add to existing services but relies on network reachability