Courses 0%
32
Software Monitoring and Observability · Chapter 32 of 42

Heartbeats and Healthchecks

Akhil
Akhil Sharma
20 min

💓 Health Checks and Heartbeats: Is Your System Alive?

Imagine if your heart stopped beating and nobody noticed for an hour. Terrifying! That's why we have health checks and heartbeats.

Health Checks: The Wellness Checkup

What is a Health Check?

The Doctor Visit Analogy: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Doctor asks:

  • Are you breathing? (Process running?)

  • Is your heart beating? (Database connected?)

  • Can you walk? (APIs responding?)

  • Do you feel okay? (Error rate normal?)

If all answers are "yes" → Healthy ✓

If any answer is "no" → Unhealthy ❌

Your Web Server: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Health check endpoint is usually : GET /health

It returns the status of the server in json format

json

Types of Health Checks:

1. Shallow Health Check (Quick Check)

Purpose: "Is the server responding?"

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Implementation: GET /health

Response time: 1-5ms

Checks: Just server process

Example:

js

Use case:

  • Load balancer health check

  • Called every 5 seconds

  • Must be FAST

Pro: Fast, low overhead

Con: Doesn't check dependencies

2. Deep Health Check (Thorough Check)

Purpose: "Is everything working?"

━━━━━━━━━━━━━━━━━━━━

Implementation: GET /health/deep Response time: 100-500ms Checks: Everything

js

Use case:

  • Deep health check is used for manual debugging

  • It can be used to provide detailed monitoring

  • Called less frequently

Pro: It is a more comprehensive form of health check Con: Slower, more resource intensive

Real Health Check Implementation:

js

Connection to Previous Topics: This eliminates SPOF! If one server fails, load balancer detects it via health checks and routes traffic to healthy servers.

Heartbeats: The Continuous Pulse

What is a Heartbeat?

The Difference: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Health Check (Pull):

Monitor ────────> Server

Heartbeat (Push):

Server ────────> Monitor

Real-World Heartbeat Example:

Background Worker Process:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

js

Heartbeat Patterns:

Pattern 1: Simple Timestamp ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Worker → Redis: SET worker:123:heartbeat 1697712345000

Monitor checks: "Last heartbeat 20 seconds ago" ✓

Pattern 2: Detailed Status ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Worker →

json

Monitor knows: What worker is doing right now!

Pattern 3: Dead Man's Switch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Worker updates key every 30 seconds Key expires after 60 seconds

If worker dies: Key expires automatically Monitor: "Key missing = worker dead" ❌

Real System Design Example

Distributed Task Queue System: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Architecture: [Queue]

← Tasks waiting

[Worker 1] [Worker 2] [Worker 3]

Health Checks:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each worker exposes: GET /health

Returns:

json

Load balancer checks every 10 seconds

Heartbeats: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Every 30 seconds, each worker:

Redis SET heartbeat:worker1

json

TTL: 60 seconds (auto-expires if worker dies)

Monitor Dashboard:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Worker 1: ✓ Alive (last heartbeat: 15s ago) Processing 2 jobs

Worker 2: ✓ Alive (last heartbeat: 8s ago) Processing 3 jobs

Worker 3: ❌ DEAD (last heartbeat: 75s ago) ALERT SENT Auto-restart initiated Jobs reassigned to Worker 1 and 2

The Complete Monitoring Stack

Real Production Setup:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Layer 1: Infrastructure Health

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ Server CPU, memory, disk

✓ Network connectivity

✓ Load balancer health checks

Tool: AWS CloudWatch, Prometheus

Layer 2: Application Health

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ /health endpoints

✓ Error rates

✓ Response times

Tool: DataDog APM, New Relic

Layer 3: Business Metrics

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ Orders per minute

✓ Revenue tracking

✓ User signups

Tool: Custom dashboard, Grafana

Layer 4: Logs

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ Application logs

✓ Error traces

✓ Audit logs

Tool: ELK Stack, Splunk

Layer 5: Alerts

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ PagerDuty (on-call)

✓ Slack notifications

✓ Email alerts

Tool: PagerDuty, OpsGenie


Key Takeaways

  1. Health checks verify a service is running and can serve requests — load balancers and orchestrators use them to route traffic
  2. Liveness checks detect if a process is alive, readiness checks detect if it can accept traffic — Kubernetes uses both
  3. Deep health checks verify dependencies — database connectivity, downstream service availability, disk space
  4. Heartbeats are periodic signals that prove a service is still running — absence of heartbeats triggers failover
Chapter complete!

Course Complete!

You've finished all 42 chapters of

System Design Indermediate

Browse courses
Up next Database Sharding
Continue