System Design: Distributed Scheduled Job Service

Requirements

Functional Requirements:

Users register jobs with cron expressions (standard 5-field + extended 6-field with seconds) or fixed-interval schedules
Jobs are HTTP callbacks (webhooks) to user-specified endpoints with configurable headers and body
At-least-once execution guarantee: every scheduled invocation fires at least once
Retry logic with configurable backoff (exponential, linear, or fixed) and max retry count
Job management: pause, resume, delete, update schedule, and view execution history
Timezone-aware scheduling with DST transition handling

Non-Functional Requirements:

Support 50 million registered jobs with 1 million job executions/minute at peak
Schedule accuracy: jobs fire within 1 second of their scheduled time (P95)
99.99% availability; missed job executions can have critical business impact
Horizontal scalability: adding scheduler nodes increases capacity linearly
Idempotent execution: the system guarantees at-least-once but consumers should handle duplicates

Scale Estimation

50M registered jobs. Distribution: 40% daily (once per day), 30% hourly, 20% every 5 minutes, 10% every minute. Per-minute job volume: (50M × 0.1) + (50M × 0.2 / 5) + (50M × 0.3 / 60) + (50M × 0.4 / 1440) = 5M + 2M + 250K + 14K = ~7.26M jobs/minute at any given minute, but actually spread out: average 350K/sec, peak 1M/minute = 16,667/sec. Each execution: HTTP POST to user endpoint, average 500ms response time → 16,667 × 0.5 = 8,334 concurrent outbound HTTP connections. Storage: 50M jobs × 500 bytes = 25GB metadata. Execution history: 7.26M/minute × 200 bytes × 30-day retention = 6.3TB.

High-Level Architecture

The architecture has three components: the Scheduler, the Dispatcher, and the Executor. The Scheduler maintains the global schedule — knowing when each job should next fire. Instead of scanning 50M jobs every second, the system uses a hierarchical time-wheel algorithm. The time wheel has three levels: a second-level wheel (60 slots), a minute-level wheel (60 slots), and an hour-level wheel (24 slots). Jobs are placed in the appropriate slot based on their next fire time. As the clock ticks, the scheduler processes all jobs in the current slot, dispatches them to the Executor, and recomputes the next fire time for recurring jobs (advancing them to their next slot).

The Scheduler is distributed across N nodes using consistent hashing on job_id. Each node owns a partition of the job space and maintains time wheels for its assigned jobs. A Coordination Service (etcd or ZooKeeper) manages partition assignments and handles rebalancing when nodes join or leave. When a scheduler node fails, its partitions are redistributed to surviving nodes, which load the affected jobs from the database and populate their time wheels. The failure detection window is 10 seconds (etcd lease TTL); jobs that were due during the failover window are fired immediately after takeover (late but not missed).

The Executor is a fleet of stateless HTTP workers. When the scheduler determines a job should fire, it creates an execution record in the database (status: pending) and publishes an execution message to a work queue (Redis-backed with visibility timeout). An executor picks up the message, makes the HTTP callback to the user's endpoint, records the result (status code, response body, latency), and updates the execution record. If the call fails (timeout, 5xx, network error), the executor applies the configured retry policy and requeues the execution with a delay.

Core Components

Hierarchical Time Wheel

The time wheel is the core scheduling data structure, providing O(1) complexity for scheduling and firing operations. The second-level wheel has 60 slots; each slot contains a linked list of jobs that fire at that second. A tick handler runs every second, processing all jobs in the current slot. Jobs with fire times beyond 60 seconds are stored in the minute-level wheel (cascading: when the minute wheel ticks, jobs in that slot are redistributed into the second-level wheel's appropriate slots). Similarly, hour-level wheel jobs cascade down to the minute wheel. This hierarchical design handles jobs with fire times up to 24 hours in the future with O(1) insertion and O(1) tick processing. Jobs with fire times beyond 24 hours are stored in a sorted set (Redis ZRANGEBYSCORE) and loaded into the time wheel when they enter the 24-hour horizon.

Partition Management & Failover

Consistent hashing partitions 50M jobs across N scheduler nodes. Each node is assigned a range of job_id hashes using virtual nodes (256 virtual nodes per physical node for even distribution). On startup, a scheduler node loads its assigned jobs from PostgreSQL (SELECT * WHERE hash(job_id) in assigned_range AND status = active), populates its time wheel, and begins ticking. Failover uses an etcd lease mechanism: each scheduler node holds a lease (10-second TTL) for its partitions. If the node fails to renew (crash or network partition), etcd revokes the lease, triggering a rebalancing callback on surviving nodes. The takeover node loads the orphaned partition's jobs and fires any overdue jobs immediately. A catch-up mechanism scans for overdue jobs (next_fire_time < now()) on startup to handle multi-node failures.*

Execution & Retry Engine

The execution engine handles the HTTP callback with configurable reliability. Each execution attempt records: attempt_number, started_at, completed_at, http_status, response_body (truncated to 4KB), error_message. Retry policies: exponential backoff (delay = base_delay × 2^attempt, max 1 hour), linear backoff (delay = base_delay × attempt), or fixed delay. The executor respects the user's endpoint response: 2xx = success, 4xx = permanent failure (no retry), 5xx = transient failure (retry). A special case: if the endpoint returns 429 (Rate Limited) with a Retry-After header, the executor uses that delay. The maximum retry count (default 5) and total retry window (default 24 hours) are configurable per job. After exhausting retries, the job is marked as failed and a notification is sent to the job owner.

Database Design

PostgreSQL stores job metadata: jobs (job_id UUID PK, user_id, name, schedule_expression VARCHAR, schedule_type ENUM(cron, interval, one_time), timezone VARCHAR, http_config JSONB containing url/method/headers/body, retry_policy JSONB, status ENUM(active, paused, deleted), next_fire_time TIMESTAMPTZ, created_at, updated_at). Index on (status, next_fire_time) for the catch-up scanner. The http_config JSONB: {"url": "https://...", "method": "POST", "headers": {"Authorization": "Bearer ..."}, "body": "{}", "timeout_ms": 30000}.

Execution history: executions (execution_id UUID PK, job_id, scheduled_time TIMESTAMPTZ, status ENUM(pending, running, success, failed, retrying), attempts JSONB array of {attempt_number, started_at, http_status, latency_ms, error}, created_at). Partitioned by scheduled_time (daily partitions) with automatic drop of partitions older than 30 days. The JSONB attempts array stores the full retry history for each execution. Redis stores the execution work queue (list with visibility timeout) and the distributed lock for partition assignments.

API Design

POST /api/v1/jobs — Register a scheduled job; body contains schedule_expression, timezone, http_config, retry_policy; returns job_id
GET /api/v1/jobs/{job_id} — Fetch job configuration and next scheduled fire time
GET /api/v1/jobs/{job_id}/executions?limit=20 — Fetch recent execution history with attempt details
PUT /api/v1/jobs/{job_id}/pause — Pause a job; it will not fire until resumed

Scaling & Bottlenecks

The scheduler tier scales horizontally: adding a node triggers rebalancing, distributing jobs evenly. Each node can handle 5M jobs in its time wheel (the time wheel is an in-memory data structure consuming ~500MB for 5M entries). With 10 nodes, the system handles 50M jobs. The bottleneck is the tick processing: if 100K jobs fire simultaneously (e.g., "every minute" jobs at the start of a minute), the scheduler must dispatch all 100K within that second. The dispatcher uses batched queue writes (Redis LPUSH with pipelining, 50K ops/sec per pipeline) — two pipelines handle the burst. Staggering job assignments (spreading "every minute" jobs across the 60 seconds using a per-job random offset) reduces tick-level bursts by 60x.

The executor fleet scales with the outbound HTTP connection requirement. At 8,334 concurrent connections (peak), 10 executor instances (each handling 1,000 concurrent HTTP calls using async I/O) provide the needed capacity. The primary external bottleneck is user endpoints: if a user's endpoint is slow (> 30 seconds), it ties up an executor slot. A per-job timeout (default 30 seconds, max 5 minutes) bounds this. Circuit breaking per user endpoint prevents a single slow user from consuming all executor capacity.

Key Trade-offs

Time wheel vs sorted set (priority queue) for scheduling: The time wheel provides O(1) tick processing vs O(log N) for a sorted set, critical when processing 100K+ jobs per tick — the trade-off is higher memory usage for the multi-level wheel structure
At-least-once vs exactly-once execution: At-least-once is achievable with the visibility timeout pattern (if the executor crashes, the message reappears and is retried), while exactly-once would require distributed transactions between the scheduler and the user's endpoint — at-least-once is the practical choice, with idempotency guidance for consumers
HTTP callbacks vs pull-based job execution: Push (HTTP callbacks) is simpler for users (no SDK needed) but requires their endpoint to be always available — pull-based (user workers poll for assigned jobs) would be more reliable but adds integration complexity
Consistent hashing vs leader-based partition assignment: Consistent hashing enables fast rebalancing (only K/N keys move when a node changes) without a central coordinator bottleneck — the trade-off is that rebalancing can temporarily create hot spots during the transition