Imagine if your heart stopped beating and nobody noticed for an hour. Terrifying! That's why we have health checks and heartbeats.
What is a Health Check?
The Doctor Visit Analogy: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Doctor asks:
Are you breathing? (Process running?)
Is your heart beating? (Database connected?)
Can you walk? (APIs responding?)
Do you feel okay? (Error rate normal?)
If all answers are "yes" → Healthy ✓
If any answer is "no" → Unhealthy ❌
Your Web Server: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Health check endpoint is usually : GET /health
It returns the status of the server in json format
Types of Health Checks:
1. Shallow Health Check (Quick Check)
Purpose: "Is the server responding?"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Implementation: GET /health
Response time: 1-5ms
Checks: Just server process
Example:
Use case:
Load balancer health check
Called every 5 seconds
Must be FAST
Pro: Fast, low overhead
Con: Doesn't check dependencies
2. Deep Health Check (Thorough Check)
Purpose: "Is everything working?"
━━━━━━━━━━━━━━━━━━━━
Implementation: GET /health/deep Response time: 100-500ms Checks: Everything
Use case:
Deep health check is used for manual debugging
It can be used to provide detailed monitoring
Called less frequently
Pro: It is a more comprehensive form of health check Con: Slower, more resource intensive
Real Health Check Implementation:
Connection to Previous Topics: This eliminates SPOF! If one server fails, load balancer detects it via health checks and routes traffic to healthy servers.
What is a Heartbeat?
The Difference: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Health Check (Pull):
Monitor ────────> Server
Heartbeat (Push):
Server ────────> Monitor
Real-World Heartbeat Example:
Background Worker Process:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Heartbeat Patterns:
Pattern 1: Simple Timestamp ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Worker → Redis: SET worker:123:heartbeat 1697712345000
Monitor checks: "Last heartbeat 20 seconds ago" ✓
Pattern 2: Detailed Status ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Worker →
Monitor knows: What worker is doing right now!
Pattern 3: Dead Man's Switch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Worker updates key every 30 seconds Key expires after 60 seconds
If worker dies: Key expires automatically Monitor: "Key missing = worker dead" ❌
Distributed Task Queue System: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Architecture: [Queue]
← Tasks waiting
[Worker 1] [Worker 2] [Worker 3]
Health Checks:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Each worker exposes: GET /health
Returns:
Load balancer checks every 10 seconds
Heartbeats: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Every 30 seconds, each worker:
Redis SET heartbeat:worker1
TTL: 60 seconds (auto-expires if worker dies)
Monitor Dashboard:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Worker 1: ✓ Alive (last heartbeat: 15s ago) Processing 2 jobs
Worker 2: ✓ Alive (last heartbeat: 8s ago) Processing 3 jobs
Worker 3: ❌ DEAD (last heartbeat: 75s ago) ALERT SENT Auto-restart initiated Jobs reassigned to Worker 1 and 2
Real Production Setup:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Layer 1: Infrastructure Health
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Server CPU, memory, disk
✓ Network connectivity
✓ Load balancer health checks
Tool: AWS CloudWatch, Prometheus
Layer 2: Application Health
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ /health endpoints
✓ Error rates
✓ Response times
Tool: DataDog APM, New Relic
Layer 3: Business Metrics
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Orders per minute
✓ Revenue tracking
✓ User signups
Tool: Custom dashboard, Grafana
Layer 4: Logs
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Application logs
✓ Error traces
✓ Audit logs
Tool: ELK Stack, Splunk
Layer 5: Alerts
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ PagerDuty (on-call)
✓ Slack notifications
✓ Email alerts
Tool: PagerDuty, OpsGenie