Use a leader-election pattern (e.g., etcd or ZooKeeper) to pick a coordinator that partitions the job schedule among workers. Each worker executes its assigned jobs and reports completion. For durability, store job state in a distributed database like PostgreSQL with advisory locks or use a queue like SQS with delayed delivery.
✦ Definition~90s read
What is Design a Distributed Job Scheduler?
A distributed job scheduler is a system that reliably executes scheduled tasks across multiple machines, handling failures, concurrency, and scale without a single point of failure. It's cron for the cloud era.
★
Imagine a team of chefs in a busy kitchen.
Plain-English First
Imagine a team of chefs in a busy kitchen. One chef (the scheduler) writes the orders on a whiteboard. Each cook (worker) picks an order, cooks it, and marks it done. If a cook gets sick, another cook picks up their unfinished orders. The whiteboard is the shared state — everyone sees the same list. That's a distributed job scheduler: a shared todo list that survives people leaving and new people joining.
Your cron job runs fine on one server. Then you scale to ten. Suddenly jobs run twice, or not at all. The 3 AM pager goes off because a payment batch ran on two boxes and double-charged customers. That's the problem a distributed job scheduler solves — reliable, exactly-once execution across a fleet. This article walks you through designing one that won't burn you at 3 AM. By the end, you'll know how to pick the right coordination primitive, handle failures without data loss, and avoid the thundering herd that takes down your database.
Why You Can't Just Use Cron
Cron is great for a single machine. But in a distributed system, you need coordination. Without it, you get duplicate executions, missed executions on node failure, and no visibility into job status. The core problem: multiple nodes must agree on who runs what, when. This is a consensus problem, but you don't need Paxos — a simple leader election with a shared lease works for most cases. The hack people used: a shared file on NFS with a lock. That breaks when NFS goes down or the lock isn't released. Don't do that.
LeaderElection.etcdSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// io.thecodeforge — SystemDesign tutorial
// Leader election using etcd lease
// Each node tries to create a key with a TTL. Only one succeeds.
const LEADER_KEY = '/scheduler/leader';
const TTL_SECONDS = 10;
async function tryBecomeLeader(etcdClient, nodeId) {
// Grant a lease with TTLconst lease = await etcdClient.lease.grant({ TTL: TTL_SECONDS });
try {
// Try to create the leader key with the lease. If it exists, thisthrows.
await etcdClient.kv.put(LEADER_KEY, nodeId, {
lease: lease.ID,
prevKv: false, // ensure key doesn't exist
});
console.log(`Node ${nodeId} became leader`);
return lease;
} catch (err) {
if (err.message.includes('key already exists')) {
console.log(`Node ${nodeId} is follower`);
lease.revoke(); // don't need the lease
returnnull;
}
throw err;
}
}
// Keep lease alive while leader
async function maintainLease(lease) {
setInterval(async () => {
await lease.keepAliveOnce();
}, TTL_SECONDS * 1000 / 2); // renew at half TTL
}
Output
Node node-1 became leader
Node node-2 is follower
Node node-3 is follower
Production Trap: Lease TTL Too Short
If your lease TTL is shorter than your job's max execution time, the leader loses leadership mid-job. Another node takes over and runs the same job. Set TTL to at least 2x the longest job duration. I've seen this cause a cascade of duplicate runs in a billing system.
thecodeforge.io
Distributed Job Scheduler Design
Design Distributed Job Scheduler
Job Partitioning: Who Runs What
Once you have a leader, it needs to assign jobs to workers. The naive approach: leader pushes jobs to a queue. But if the leader dies, in-flight jobs are lost. Better: workers pull jobs from a shared database. The leader's only job is to mark which jobs are ready to run. Workers compete for jobs using optimistic locking. This avoids the single point of failure — if the leader dies, workers keep running jobs they've already claimed. The leader just coordinates the schedule.
JobClaim.postgresqlSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- io.thecodeforge — SystemDesign tutorial
-- Job table with status and version for optimistic locking
CREATETABLEjobs (
id UUIDPRIMARYKEY,
schedule_time TIMESTAMPTZNOTNULL,
status TEXTNOTNULLDEFAULT'pending', -- pending, claimed, running, done, failed
claimed_by TEXT,
version INTNOTNULLDEFAULT1
);
-- Worker claims a job atomically
-- Uses advisory lock to prevent two workers from claiming the same job
UPDATE jobs
SET status = 'claimed',
claimed_by = 'worker-1',
version = version + 1WHERE id = 'job-uuid-here'AND status = 'pending'ANDpg_try_advisory_xact_lock(hashtext(id::text)); -- advisory lock on job ID
-- If0 rows updated, another worker got it. Retry with another job.
Output
UPDATE 1 -- worker-1 claimed the job
UPDATE 0 -- worker-2 tried, but job already claimed
Senior Shortcut: Use PostgreSQL Advisory Locks
Advisory locks are lightweight and don't block other rows. They're perfect for distributed job claiming without a separate locking service. Just be careful: they're session-level, so if a worker crashes, the lock is released automatically. That's exactly what you want.
Handling Worker Failures: The Heartbeat Protocol
A worker claims a job, starts executing, then dies. The job stays 'claimed' forever. You need a heartbeat mechanism. Each worker periodically updates a 'last_heartbeat' timestamp in the job row. A separate watchdog (or the leader) scans for jobs where heartbeat is too old and resets them to 'pending'. The classic rookie mistake: using a fixed timeout. If the job takes longer than the timeout, it gets falsely re-queued. Solution: workers update heartbeat during long-running jobs, or use a per-job timeout stored in the job definition.
HeartbeatCheck.postgresqlSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- io.thecodeforge — SystemDesign tutorial
-- Worker updates heartbeat every 30 seconds
UPDATE jobs
SET last_heartbeat = NOW()
WHERE id = 'job-uuid-here'AND claimed_by = 'worker-1';
-- Watchdog (runs every minute) resets stale jobs
UPDATE jobs
SET status = 'pending',
claimed_by = NULL,
version = version + 1WHERE status = 'claimed'AND last_heartbeat < NOW() - INTERVAL'90 seconds' -- 3x heartbeat interval
ANDpg_try_advisory_xact_lock(hashtext(id::text));
-- Return the count of reset jobs for alerting
SELECTCOUNT(*) AS stale_jobs_reset FROM jobs WHERE ... ;
Output
UPDATE 0 -- no stale jobs
-- or
UPDATE 2 -- two jobs were stuck, now reset
Never Do This: Rely on TCP Timeouts
Don't assume a crashed worker's TCP connection will close immediately. It can take minutes. Always use application-level heartbeats with a timeout. I've seen a job stay 'claimed' for 15 minutes because the worker's process was killed but the TCP socket wasn't cleaned up.
thecodeforge.io
Worker Failure Recovery Flow
Design Distributed Job Scheduler
Exactly-Once Semantics: The Holy Grail
Exactly-once is a lie in distributed systems. You can only get at-most-once or at-least-once. For a job scheduler, at-least-once is acceptable if jobs are idempotent. But idempotency is hard. Better: design your jobs to be idempotent by using a unique idempotency key (e.g., job run UUID). The job checks if it already processed that key before doing work. This is the pattern used by payment gateways and message queues. If your job isn't idempotent, you need a distributed transaction — and that's a whole other level of pain.
IdempotencyCheck.pythonSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# io.thecodeforge — SystemDesign tutorial
import uuid
from redis importRedis
redis = Redis()
def process_job(job_id: str, run_id: str):
# Checkifthis run was already processed
key = f"job:{job_id}:run:{run_id}"if redis.setnx(key, "done"): # set if not exists, atomic
# This is the first time we see this run
# Do the actual work here
print(f"Processing job {job_id} run {run_id}")
# ...
# SetTTL to avoid memory leak
redis.expire(key, 86400) # 24 hours
else:
print(f"Job {job_id} run {run_id} already processed, skipping")
# Usageprocess_job("invoice-batch-123", str(uuid.uuid4()))
Output
Processing job invoice-batch-123 run a1b2c3d4-...
Processing job invoice-batch-123 run a1b2c3d4-... skipped (duplicate)
Interview Gold: Idempotency Key Pattern
This pattern is the answer to 'how do you handle duplicate messages in a distributed system?' Use a unique key per operation and a set-if-not-exists operation (Redis SETNX, PostgreSQL ON CONFLICT DO NOTHING). The key must be deterministic — e.g., hash of the input parameters.
Thundering Herd: When All Jobs Fire at Once
The top of the hour. Every job scheduled for 00:00 fires simultaneously. Your database melts. This is the thundering herd problem. Solution: add jitter. Randomize the start time of each job within a window. For example, if a job is scheduled for 00:00, actually start it at 00:00 + random(0, 60) seconds. But be careful: some jobs must run at exact times (e.g., financial close). For those, use a staggered release: the scheduler releases jobs in batches over a few seconds.
JitterScheduler.pythonSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# io.thecodeforge — SystemDesign tutorial
import random
from datetime import datetime, timedelta
def schedule_with_jitter(job_definition: dict, now: datetime) -> datetime:
"""
Returns the actual execution time with jitter.
job_definition has 'scheduled_time' and 'jitter_window_seconds' (default60).
"""
base_time = job_definition['scheduled_time']
jitter = random.randint(0, job_definition.get('jitter_window_seconds', 60))
actual_time = base_time + timedelta(seconds=jitter)
# Don't schedule in the past
if actual_time < now:
actual_time = now
return actual_time
# Example: job scheduled for midnight
job = {'scheduled_time': datetime(2025, 1, 1, 0, 0), 'jitter_window_seconds': 120}
print(schedule_with_jitter(job, datetime(2025, 1, 1, 0, 0, 30)))
# Output: 2025-01-0100:01:45 (random, between 0 and 120 seconds after midnight)
Output
2025-01-01 00:01:45
Production Trap: Jitter on Retries
Don't forget jitter on retries. If a job fails and retries immediately, it hits the same resource at the same time. Always add exponential backoff with jitter. I've seen a retry storm take down an API because every retry came at exactly 1 second, 2 seconds, 4 seconds — all synchronized.
thecodeforge.io
Thundering Herd: With vs Without Jitter
Design Distributed Job Scheduler
Monitoring: What to Watch
You can't debug a distributed scheduler without metrics. Track: jobs scheduled vs executed (gap = missed), execution duration (p99, p999), failure rate by job type, leader election count (should be low), heartbeat staleness. Alert on: no heartbeat for 2x interval, job not executed within 2x schedule interval, leader election flapping (more than 1 per hour). Use a time-series database like Prometheus. Log every state transition with a correlation ID.
Metrics.prometheusSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# io.thecodeforge — SystemDesign tutorial
# Prometheus metrics for job scheduler
# Counter: jobs started
job_starts_total = Counter('scheduler_job_starts_total', 'Total number of job starts', ['job_name'])
# Histogram: job duration
job_duration_seconds = Histogram('scheduler_job_duration_seconds', 'Job execution duration', ['job_name'], buckets=[1, 5, 10, 30, 60, 120, 300])
# Gauge: leader status (1 = leader, 0 = follower)
leader_status = Gauge('scheduler_leader_status', '1 if this node is leader, 0 otherwise')
# In job execution:
job_starts_total.labels(job_name=job.name).inc()
with job_duration_seconds.labels(job_name=job.name).time():
# run job
pass
# In leader election:
leader_status.set(1if is_leader else0)
Output
No output — metrics are scraped by Prometheus
Senior Shortcut: Log Correlation IDs
Include a unique run ID in every log line. When a job fails, you can grep for that ID across all workers. Without it, debugging a distributed failure is like finding a needle in a haystack of timestamps.
● Production incidentPOST-MORTEMseverity: high
The Double-Charge Disaster
Symptom
Customers complained of duplicate charges. The payment batch job ran twice, processing the same invoices.
Assumption
We assumed the job was idempotent. It wasn't — the payment gateway didn't dedupe on our transaction ID.
Root cause
Two scheduler nodes both thought they were the leader due to a split-brain in ZooKeeper (network partition + stale sessions). Both picked up the same job from the database.
Fix
Switched to etcd with a tighter lease TTL (5 seconds) and added a 'running' flag in the job row with a unique execution UUID. Workers check the flag before starting.
Key lesson
Idempotency is a lie.
Always assume your job will run twice and design for it.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Job runs multiple times on different nodes
→
Fix
1. Check leader election logs for flapping. 2. Verify lease TTL is longer than max job duration. 3. Check for network partitions in coordination service. 4. Add idempotency key to job execution.
Symptom · 02
Job never runs after node failure
→
Fix
1. Check watchdog process is running. 2. Verify heartbeat timeout is correctly configured. 3. Check that the watchdog has permissions to update job status. 4. Manually reset stuck jobs with SQL.
Symptom · 03
Database CPU spikes at the start of every minute
→
Fix
1. Check job schedule for many jobs at the same second. 2. Add jitter to job start times. 3. Increase database connection pool size. 4. Consider batching job claims.
★ Distributed Job Scheduler Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Duplicate job execution−
Immediate action
Check if leader election is stable
Commands
etcdctl get /scheduler/leader --prefix
etcdctl lease list
Fix now
Increase lease TTL to 2x max job duration. Add idempotency key.
Stuck jobs (status='claimed' for hours)+
Immediate action
Check worker heartbeats
Commands
SELECT * FROM jobs WHERE status='claimed' AND last_heartbeat < NOW() - INTERVAL '5 minutes';
SELECT claimed_by, COUNT(*) FROM jobs WHERE status='claimed' GROUP BY claimed_by;
Fix now
Manually UPDATE jobs SET status='pending', claimed_by=NULL WHERE status='claimed' AND last_heartbeat < NOW() - INTERVAL '5 minutes';
High database CPU at schedule times+
Immediate action
Check for thundering herd
Commands
SELECT COUNT(*), DATE_TRUNC('minute', schedule_time) FROM jobs WHERE status='pending' GROUP BY 1 ORDER BY 1;
SHOW max_connections;
Fix now
Add jitter to job schedule times. Increase connection pool. Use advisory locks.
Leader election flapping+
Immediate action
Check network latency between nodes
Commands
ping -c 10 <other-node-ip>
etcdctl endpoint health --cluster
Fix now
Increase lease TTL. Ensure etcd cluster has odd number of nodes. Check for network congestion.
Feature / Aspect
Leader Election (etcd/ZK)
Database Polling (PostgreSQL)
Consistency
Strong (linearizable)
Eventual (read committed)
Failure detection
Lease expiry (seconds)
Heartbeat timeout (configurable)
Scalability
Good up to ~1000 nodes
Good up to ~100 workers (DB bottleneck)
Operational complexity
Requires etcd/ZK cluster
Uses existing DB
Split-brain risk
Low (with proper quorum)
Higher (network partition can cause two leaders)
Key takeaways
1
Leader election with leases is the foundation
but always handle split-brain with quorum.
2
Idempotency is not optional. Use idempotency keys even if you think your job is idempotent.
3
Add jitter to everything
start times, retries, heartbeats. Your database will thank you.
4
The simplest distributed scheduler that works is better than the perfect one that's too complex to operate.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
How does your distributed job scheduler handle a network partition that ...
Q02SENIOR
When would you choose a database-based scheduler (PostgreSQL polling) ov...
Q03SENIOR
What happens when a worker claims a job, starts executing, and then the ...
Q04JUNIOR
What is a cron job and how does it work?
Q05SENIOR
You notice that a job runs twice even though you have a unique constrain...
Q06SENIOR
Design a job scheduler that can handle 1 million scheduled jobs per day ...
Q01 of 06SENIOR
How does your distributed job scheduler handle a network partition that splits the cluster into two groups, each thinking they're the leader?
ANSWER
Use a quorum-based coordination service like etcd or ZooKeeper. Only the group with the majority of nodes can elect a leader. The minority group's leader election fails because it can't get a quorum. This prevents split-brain. Additionally, use fencing tokens (epoch numbers) to ensure stale leaders can't write to the database.
Q02 of 06SENIOR
When would you choose a database-based scheduler (PostgreSQL polling) over a coordination-service-based one (etcd)?
ANSWER
Choose database-based when your team already runs PostgreSQL, you have fewer than 100 workers, and you want minimal operational overhead. Choose etcd when you need strong consistency, faster failure detection, and can tolerate running a separate cluster. Example: a small internal cron replacement vs a global payment scheduler.
Q03 of 06SENIOR
What happens when a worker claims a job, starts executing, and then the database connection drops? How do you prevent the job from being lost or duplicated?
ANSWER
The worker should update the job status to 'running' with a heartbeat. If the connection drops, the worker's heartbeat stops. The watchdog resets the job to 'pending' after a timeout. To prevent duplication, the worker checks if the job is still claimed by it before proceeding (using the version column). If another worker already reset it, the original worker aborts.
Q04 of 06JUNIOR
What is a cron job and how does it work?
ANSWER
A cron job is a scheduled task on Unix-like systems. The cron daemon reads a crontab file and executes commands at specified times. It's simple but not distributed — it runs on a single machine and has no failure handling.
Q05 of 06SENIOR
You notice that a job runs twice even though you have a unique constraint on the job run ID. What's the likely cause and how do you fix it?
ANSWER
The unique constraint might not be enforced due to a race condition: two workers insert the same run ID in separate transactions before the constraint check. Fix: use a database-level INSERT ... ON CONFLICT DO NOTHING or a pessimistic lock (SELECT FOR UPDATE) before the insert. Alternatively, use a distributed lock around the idempotency check.
Q06 of 06SENIOR
Design a job scheduler that can handle 1 million scheduled jobs per day with sub-minute precision. What's your architecture?
ANSWER
Use a two-tier approach: a time-wheel in memory for near-future jobs (next 5 minutes) and a database for the rest. The leader loads the next 5 minutes of jobs into a priority queue. Workers pull from the queue. For persistence, write job state to a distributed log (Kafka) and replay on leader failover. Use sharding by job ID to distribute load. Monitor queue depth and worker lag.
01
How does your distributed job scheduler handle a network partition that splits the cluster into two groups, each thinking they're the leader?
SENIOR
02
When would you choose a database-based scheduler (PostgreSQL polling) over a coordination-service-based one (etcd)?
SENIOR
03
What happens when a worker claims a job, starts executing, and then the database connection drops? How do you prevent the job from being lost or duplicated?
SENIOR
04
What is a cron job and how does it work?
JUNIOR
05
You notice that a job runs twice even though you have a unique constraint on the job run ID. What's the likely cause and how do you fix it?
SENIOR
06
Design a job scheduler that can handle 1 million scheduled jobs per day with sub-minute precision. What's your architecture?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
How do I design a distributed job scheduler that runs exactly once?
You can't guarantee exactly-once in a distributed system. Aim for at-least-once with idempotent jobs. Use a unique idempotency key (e.g., job run UUID) and check it before processing. For critical jobs, use a distributed transaction with two-phase commit, but that's complex and slow.
Was this helpful?
02
What's the difference between a job scheduler and a message queue?
A job scheduler executes tasks at specific times. A message queue delivers messages to consumers as soon as possible. Schedulers need time-based triggers; queues don't. You can combine them: scheduler enqueues a message at the scheduled time, and workers consume from the queue.
Was this helpful?
03
How do I handle a job that takes longer than the scheduler's heartbeat interval?
The worker should periodically update the heartbeat timestamp during execution. Set the watchdog timeout to at least 3x the heartbeat interval. Alternatively, use a per-job timeout stored in the job definition, and have the worker extend the lease if the job is still running.
Was this helpful?
04
What happens if the leader dies mid-schedule? How do jobs get picked up?
The leader's lease expires. Other nodes detect this and hold a new leader election. The new leader scans for jobs that were scheduled but not yet claimed (status='pending') and assigns them. Jobs that were claimed by the dead leader will be reset by the watchdog after the heartbeat timeout.