Skip to content
Home Interview Job Scheduler Design — Preventing Duplicate Execution

Job Scheduler Design — Preventing Duplicate Execution

Where developers are forged. · Structured learning · Free forever.
📍 Part of: System Design Interview → Topic 6 of 7
Recovery scripts re-queued processed jobs, causing duplicate invoices.
🔥 Advanced — solid Interview foundation required
In this tutorial, you'll learn
Recovery scripts re-queued processed jobs, causing duplicate invoices.
  • Use a distributed metadata store to ensure job persistence across scheduler restarts.
  • Implement 'At Least Once' delivery combined with 'Idempotent Workers' to handle network failures safely.
  • Scale horizontally by sharding the delay queue and using distributed coordination for job picking.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • A job scheduler manages time-based execution of tasks across distributed workers
  • Core components: scheduler, metadata store, delay queue, worker pool
  • Redis Sorted Set with score = scheduled timestamp enables O(log n) insertion and O(1) due-check
  • Without idempotent workers, network retries cause duplicate executions
  • Biggest mistake: assuming exactly-once delivery is possible — aim for at-least-once + idempotency
🚨 START HERE

Quick Debug Cheat Sheet for Job Scheduler

One-liner commands and fixes for the top 3 scheduler issues.
🟡

Scheduler not polling delay queue

Immediate ActionCheck if scheduler process is alive. Restart the scheduler container.
Commands
kubectl get pods -l app=scheduler
docker compose logs scheduler --tail 100
Fix Nowkubectl rollout restart deployment/scheduler
🟡

Duplicate job execution

Immediate ActionCheck idempotency key usage. Inspect logs for duplicate job IDs.
Commands
grep -i 'duplicate' /var/log/scheduler.log
redis-cli ZSCORE delay_queue <jobId>
Fix NowEnable idempotency key check and set appropriate visibility timeout.
🟡

Worker timeout / job stuck

Immediate ActionIdentify the job ID and check lease expiry.
Commands
redis-cli GET lease:<jobId>
redis-cli TTL lease:<jobId>
Fix NowManually release lease: redis-cli DEL lease:<jobId>. Then adjust visibility timeout.
Production Incident

The Case of the Duplicate Invoice

A finance scheduler sent duplicate invoices after a recovery from backup.
SymptomCustomers received identical invoices twice for the same billing period.
AssumptionThe scheduler's recovery process is safe to restart from a DB snapshot.
Root causeThe recovery script re-queued all jobs that were in 'PENDING' state at backup time, including jobs that had already been processed but whose status update hadn't been flushed to disk.
FixAdd an idempotency key check on job submission. Ensure recovery skips jobs that were already completed by querying a processed_jobs dedup table.
Key Lesson
Never re-queue jobs during recovery without checking if they've already been executed.Use idempotency keys to prevent duplicate submissions even in normal operations.
Production Debug Guide

Symptoms and immediate actions for common scheduler failures

Job is not executing at scheduled timeCheck the delay queue in Redis: ZRANGEBYSCORE delay_queue 0 <now>. If job missing, check if scheduler is polling. Check scheduler health checks and logs.
Job executes multiple timesLook for duplicate job submissions (idempotency key collisions). Check visibility timeout: if worker takes longer than timeout, job may be re-queued. Verify heartbeat mechanism.
Worker pool is idle but queue is growingCheck if workers have proper queue subscription. Verify worker autoscaling policies. Check for poison pill jobs that cause worker crashes.
Jobs are stuck in 'PROCESSING' status for hoursIdentify the worker that claimed the job. If worker is dead, the lease should expire. If lease TTL is too long, reduce it. Use dead-letter queue for stale processing jobs.

Every production system you've ever used is secretly running a job scheduler behind the scenes. GitHub Actions triggers your CI pipeline. Netflix re-encodes video in background workers. Your bank sends monthly statements at 2 AM. Uber's surge-pricing model re-trains on fresh data every few minutes. None of these happen because someone clicked a button — they happen because a scheduler decided it was time, found a free worker, handed the job over, and made sure it finished. Scheduling is the silent engine of the internet.

The problem a job scheduler solves is deceptively simple: 'run this thing at this time.' But the moment you add scale, reliability, and fairness requirements, the surface area explodes. What happens when the machine running the scheduler dies? What if a job takes ten times longer than expected? How do you stop one noisy tenant from starving everyone else? What if the same job fires twice because of a clock drift? These aren't hypotheticals — they're Tuesday in any company running infrastructure at scale.

By the end of this article you'll be able to walk into a system design interview and confidently sketch a distributed job scheduler from first principles. You'll understand how to choose between push and pull delivery, how to build a reliable delay queue using sorted sets, how to design idempotent workers, how to handle retries with exponential back-off, and how to reason about exactly-once execution guarantees — and why that last one is almost always a lie.

Core Architecture: From Delay Queues to Distributed Execution

A job scheduler isn't just a timer; it's a state machine. To design one that won't lose data, you need four distinct components: an API for job submission, a Metadata Store (Postgres or DynamoDB) to track job states, a Delay Queue (Redis Sorted Sets are the industry standard here) to handle the 'waiting' room, and a Worker Pool to do the heavy lifting.

In a distributed setup, multiple Schedulers poll the Delay Queue. When a job's execution time matches the current timestamp, the Scheduler moves the job from the 'Delayed' set to an 'Active' queue (like RabbitMQ or Kafka). Workers then pull from this queue. To prevent 'Zombie Jobs' (tasks that are assigned but never finish because a worker crashed), we implement a visibility timeout—if the worker doesn't send an 'ACK' within 5 minutes, the job is re-queued.

In production, ensure that the scheduler's polling interval is less than the visibility timeout to avoid latency. Set the visibility timeout based on the 99th percentile job execution time plus a buffer.

io.thecodeforge.scheduler.JobController.java · JAVA
1234567891011121314151617181920212223242526272829303132333435363738394041
package io.thecodeforge.scheduler;

import java.util.UUID;
import java.time.Instant;

/**
 * Production-grade Job Submission Logic.
 * Note the use of Idempotency Keys to prevent duplicate submissions.
 */
public class JobController {
    
    public JobResponse scheduleJob(JobRequest request) {
        String jobId = UUID.randomUUID().toString();
        long scheduledTime = Instant.parse(request.getStartTime()).getEpochSecond();
        
        // In the Forge, we use Redis ZSET for delay management
        // ZADD delay_queue <scheduledTime> <jobId>
        System.out.println("Job [" + jobId + "] added to Redis ZSet at priority: " + scheduledTime);
        
        return new JobResponse(jobId, "ACCEPTED");
    }

    public static void main(String[] args) {
        JobController forgeScheduler = new JobController();
        JobRequest emailJob = new JobRequest("2026-03-15T14:00:00Z", "SEND_EMAIL");
        forgeScheduler.scheduleJob(emailJob);
    }
}

class JobRequest { 
    private String startTime; 
    private String type; 
    public JobRequest(String t, String type) { this.startTime = t; this.type = type; }
    public String getStartTime() { return startTime; }
}

class JobResponse { 
    private String id; 
    private String status; 
    public JobResponse(String id, String s) { this.id = id; this.status = s; }
}
▶ Output
Job [550e8400-e29b-41d4-a716-446655440000] added to Redis ZSet at priority: 1773669600
🔥Forge Tip: Use Two-Phase Commit for Queue/DB Consistency
The most common failure is updating the Database but failing to push to the Queue. Use the Transactional Outbox pattern: write the job to your DB, then a separate process (the Relay) pushes it to the queue once the DB transaction is safe.
📊 Production Insight
The most common production failure is a scheduler crash between DB update and queue push.
Use the Transactional Outbox pattern to guarantee both writes succeed.
Rule: never update the DB and push to queue in the same transaction unless using XA.
🎯 Key Takeaway
Separate job metadata from queue state.
Use idempotency keys to prevent duplicate submissions.
Visibility timeout protects against zombie workers.

Handling Scale: Partitioning and Fault Tolerance

When you have 10 million jobs per second, a single Redis instance becomes a bottleneck. We solve this by sharding the Delay Queue based on a Job_ID hash. Furthermore, we use a distributed locking mechanism (like Redlock or Zookeeper) to ensure that only one Scheduler instance processes a specific time-slice of the queue at once, preventing duplicate job firing.

For worker reliability, we utilize a 'DLQ' (Dead Letter Queue). If a job fails 3 times after exponential back-off, we move it to the DLQ for manual intervention rather than retrying forever and wasting CPU cycles.

In practice, use Redis Cluster for horizontal scaling of the delay queue. For sharding, assign each scheduler a hash slot range to avoid lock contention.

docker-compose.yml · DOCKER
12345678910111213141516
version: '3.8'
services:
  redis-delay-queue:
    image: redis:7.2
    container_name: forge-delay-node
    command: ["redis-server", "--save", "60", "1"] # Persist for reliability
    ports:
      - "6379:6379"

  worker-service:
    image: io.thecodeforge/scheduler-worker:latest
    environment:
      - REDIS_HOST=redis-delay-queue
      - MAX_RETRIES=3
    deploy:
      replicas: 5 # Horizontal scaling in action
▶ Output
Infrastructure ready. Redis persistence enabled for durability.
⚠ Watch Out: The Precision Problem
No distributed scheduler has millisecond precision. If an interviewer asks how to handle a job that must run at exactly 12:00:00.001, explain the trade-offs of clock drift (NTP) and network jitter.
📊 Production Insight
Clock drift across servers causes jobs to fire early or late by hundreds of milliseconds.
Use a centralized monotonic clock service or logical timestamps for high-precision requirements.
Rule: never assume synchronized clocks; always build in tolerance.
🎯 Key Takeaway
Shard by job ID hash to avoid single-node bottlenecks.
Use distributed locks to avoid duplicate polling.
Dead Letter Queue prevents infinite retries.

Worker Lifecycle and Idempotency Design

Workers must be designed to tolerate failure mid-execution. The job status in the metadata store (e.g., 'PROCESSING') is updated before work begins, and a heartbeat mechanism refreshes a 'leases' record in Redis. If the heartbeat stops, another worker can pick up the task after the lease expires. Idempotency is achieved by including a deterministic job ID in all external side effects (e.g., DB insert or API call). The worker checks if the effect already occurred before performing it again. This is especially critical for financial operations where duplicate charges are unacceptable.

A common pattern is to use a 'dedup' table keyed by job ID. On each execution attempt, the worker runs a conditional insert. If the insert fails (unique constraint violation), the work has already been done, and the worker can safely skip the job.

io.thecodeforge.scheduler.Worker.java · JAVA
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
package io.thecodeforge.scheduler;

import redis.clients.jedis.Jedis;
import java.util.UUID;

public class Worker {
    private static final String DEDUP_PREFIX = "dedup:";
    private static final String LEASE_PREFIX = "lease:";
    private static final int LEASE_TTL_SECONDS = 120;

    public boolean tryAcquireLease(String jobId) {
        try (Jedis jedis = new Jedis("redis-delay-queue")) {
            String result = jedis.set(LEASE_PREFIX + jobId, workerId, "NX", "EX", LEASE_TTL_SECONDS);
            return "OK".equals(result);
        }
    }

    public boolean isAlreadyProcessed(String jobId) {
        try (Jedis jedis = new Jedis("redis-delay-queue")) {
            return jedis.exists(DEDUP_PREFIX + jobId);
        }
    }

    public void markProcessed(String jobId) {
        try (Jedis jedis = new Jedis("redis-delay-queue")) {
            jedis.setex(DEDUP_PREFIX + jobId, 86400, "1"); // expire after 1 day
        }
    }

    public void processJob(Job job) {
        if (!tryAcquireLease(job.getId())) {
            return; // another worker has the lease
        }
        if (isAlreadyProcessed(job.getId())) {
            return; // job already executed successfully
        }
        // execute the actual task
        try {
            performWork(job);
            markProcessed(job.getId());
        } catch (Exception e) {
            // job fails – lease will expire allowing retry
            throw e;
        }
    }

    private void performWork(Job job) { 
        // actual work here 
    }
}

class Job { 
    private String id; 
    public String getId() { return id; } 
}
▶ Output
Lease acquired, dedup checked, work completed.
🔥Forge Tip: Lease-based Task Ownership
Use a Redis key with TTL equal to the visibility timeout. The worker periodically extends the TTL while alive. If the worker dies, the key expires and the job is eligible for re-assignment.
📊 Production Insight
Without heartbeats, a long-running job that is still valid gets re-assigned to another worker after the visibility timeout expires, causing duplicate work.
Always extend TTL proactively every few seconds.
Rule: the lease TTL must be greater than the heartbeat interval.
🎯 Key Takeaway
Idempotent workers + lease-based ownership = at-least-once delivery without duplicates.
Use a dedup table to prevent side effects from reruns.
Heartbeats prevent premature lease expiry.

Handling Retries, Backoff and Dead Letter Queues

When a worker fails to complete a job, the scheduler must retry. Simple immediate retries can cause thundering herd. The standard approach is exponential backoff with jitter. For each job, track the retry count in the metadata store. After each failure, the scheduler increments the count and reschedules the job at time = now + (base_delay * 2^attempt) + random_jitter. After a maximum number of retries, the job moves to a Dead Letter Queue (DLQ) – a separate queue or table that requires manual inspection. Monitoring the DLQ depth is essential to detect systemic failures.

Choose the base delay and maximum retries based on your job's latency tolerance. For critical jobs, you might retry 5 times with a base delay of 30 seconds. For batch jobs, you might retry 3 times with a base delay of 5 minutes.

io.thecodeforge.scheduler.RetryScheduler.java · JAVA
12345678910111213141516171819202122232425262728293031
package io.thecodeforge.scheduler;

import java.time.Instant;
import java.util.Random;
import redis.clients.jedis.Jedis;

public class RetryScheduler {
    private static final int BASE_DELAY_SECONDS = 30;
    private static final int MAX_RETRIES = 5;
    private static final String RETRY_COUNT_KEY = "retries:";
    private static final String DLQ_KEY = "dlq:jobs";

    public void handleFailure(String jobId, Jedis jedis) {
        int retryCount = getRetryCount(jobId, jedis);
        if (retryCount >= MAX_RETRIES) {
            jedis.lpush(DLQ_KEY, jobId);
            System.out.println("Job [" + jobId + "] moved to DLQ after " + retryCount + " retries.");
            return;
        }
        long delay = (long) (BASE_DELAY_SECONDS * Math.pow(2, retryCount) + new Random().nextInt(1000));
        long newScheduledTime = Instant.now().getEpochSecond() + delay;
        jedis.zadd("delay_queue", newScheduledTime, jobId);
        jedis.incr(RETRY_COUNT_KEY + jobId);
        System.out.println("Job [" + jobId + "] rescheduled in " + delay + "s (attempt " + (retryCount+1) + ")");
    }

    private int getRetryCount(String jobId, Jedis jedis) {
        String val = jedis.get(RETRY_COUNT_KEY + jobId);
        return val == null ? 0 : Integer.parseInt(val);
    }
}
▶ Output
Job [job123] moved to DLQ after 5 retries.
⚠ Watch Out: Retry Storm
If all workers fail due to a transient dependency (e.g., DB outage), exponential backoff reduces load. But if you skip jitter, all retries happen at the same time – effectively a self-inflicted DDoS.
📊 Production Insight
A misconfigured retry policy with large base delay can cause unacceptable latency for time-sensitive jobs.
For low-latency requirements, use a separate fast queue with immediate retries and a small max attempt count.
Rule: always use jitter to stagger retry times.
🎯 Key Takeaway
Exponential backoff + jitter prevents retry storms.
DLQ isolates permanently failing jobs.
Monitor DLQ depth as a key operational metric.

Monitoring and Observability for Job Scheduler

A job scheduler is a black box until a job doesn't run. Instrument every component: (1) Queue depth and lag: how many jobs are waiting and how long they've been delayed beyond their scheduled time. (2) Worker pool utilization: percentage of workers busy, queue backlog. (3) Job completion rate: processed jobs per second, success/failure ratio. (4) Retry count distribution: number of jobs on first attempt vs multiple retries. (5) Scheduler health: check that the scheduler process is alive and polling. Use Prometheus metrics and Grafana dashboards. Also export job-level logs with a structured format (job ID, execution time, result) to a central log aggregation system.

Define SLOs: e.g., 99% of jobs start within 5 seconds of their scheduled time. Alert on any deviation. Use distributed tracing to correlate scheduling events with worker execution.

io.thecodeforge.scheduler.MetricsExporter.java · JAVA
12345678910111213141516171819202122232425262728293031323334
package io.thecodeforge.scheduler;

import io.prometheus.client.Counter;
import io.prometheus.client.Gauge;
import io.prometheus.client.Histogram;
import io.prometheus.client.exporter.HTTPServer;

public class MetricsExporter {
    static final Gauge queueDepth = Gauge.build()
        .name("scheduler_queue_depth")
        .help("Number of jobs in delay queue.")
        .register();

    static final Counter jobsProcessed = Counter.build()
        .name("jobs_processed_total")
        .help("Total jobs processed.")
        .labelNames("status")
        .register();

    static final Histogram jobLatency = Histogram.build()
        .name("job_execution_latency_seconds")
        .help("Time from scheduled time to job start.")
        .buckets(0.1, 0.5, 1.0, 2.0, 5.0, 10.0)
        .register();

    public static void main(String[] args) throws Exception {
        HTTPServer server = new HTTPServer(1234);
        // In a real scheduler, call update methods periodically
        queueDepth.set(245);
        jobsProcessed.labels("success").inc(1423);
        jobsProcessed.labels("failure").inc(12);
        System.out.println("Metrics server running on port 1234");
    }
}
▶ Output
Metrics server running on port 1234
🔥Forge Tip: Job Execution Trace
Emit a tracing span for each job execution with parent schedule span. This lets you pinpoint where time is spent – queue wait, worker execution, or retry loops.
📊 Production Insight
Without queue depth monitoring, a scheduler that silently fails to poll the delay queue will cause all jobs to miss their scheduled time until an outage is reported by customers.
Set an alert on queue depth > 10x normal and zero processing rate.
Rule: measure what you need to debug at 3 AM.
🎯 Key Takeaway
Define SLOs for job delivery latency and completion rate.
Alert on queue depth growth and zero worker utilization.
Distributed tracing is the only way to debug end-to-end delays.
🗂 Push vs Pull Delivery
Choosing the right job delivery mechanism
FeaturePush-Based (e.g., Webhooks)Pull-Based (e.g., SQW/Kafka)
ComplexityHigh (Requires tracking state/retries)Low (Worker manages its own pace)
Worker ControlScheduler pushes; can overwhelm workerWorker pulls when ready (Back-pressure)
LatencyNear real-timeDetermined by polling interval
ReliabilityHarder to guarantee deliveryHigh (Job stays in queue until ACK)

🎯 Key Takeaways

  • Use a distributed metadata store to ensure job persistence across scheduler restarts.
  • Implement 'At Least Once' delivery combined with 'Idempotent Workers' to handle network failures safely.
  • Scale horizontally by sharding the delay queue and using distributed coordination for job picking.
  • Always implement exponential back-off and Dead Letter Queues (DLQ) for failed tasks.
  • Instrument every component: queue depth, processing rate, retry count, and latency. Set alerts on anomalies.
  • Heartbeat mechanism prevents premature re-assignment of long-running jobs.

⚠ Common Mistakes to Avoid

    Not making workers Idempotent
    Symptom

    Networks fail. A worker might finish the job but fail to send the ACK. The scheduler will re-run it. If your code isn't idempotent, you'll charge the customer twice.

    Fix

    Include a dedup check before executing side effects. Use deterministic job IDs to detect duplicates. Ensure that repeating an operation has no additional impact.

    Using cron on a single server
    Symptom

    If that server goes down, the entire business logic stops. No jobs run until manual recovery.

    Fix

    Use a distributed job scheduler with replicated metadata and multiple scheduler instances. Implement high availability via leader election (e.g., Zookeeper or etcd).

    Ignoring Clock Drift
    Symptom

    '12:00 PM' isn't the same on every machine. Jobs may fire early or late by hundreds of milliseconds.

    Fix

    Use relative offsets from a monotonic clock, or base timing on a centralized time source (e.g., NTP with tight bounds). For high precision, use logical timestamps.

    Lack of Monitoring
    Symptom

    Queue Depth increases silently. Jobs accumulate faster than workers can process them, but no alert fires. Eventually system falls over.

    Fix

    Export queue depth, processing rate, and latency metrics to Prometheus. Set alerts on queue depth > expected max and processing rate = 0 for 2 minutes.

    Visibility Timeout Too Short
    Symptom

    Long-running jobs timeout before completion. Another worker picks up the same job, causing duplicate processing.

    Fix

    Set visibility timeout based on the 99th percentile job execution time plus a buffer. Implement heartbeat mechanism to extend timeout dynamically.

Interview Questions on This Topic

  • QHow would you design a delay queue for a distributed job scheduler?SeniorReveal
    Use Redis Sorted Sets where the score is the scheduled Unix timestamp. The scheduler polls with ZRANGEBYSCORE to fetch jobs due now. For scale, shard the set by job ID hash across multiple Redis nodes. Ensure durability with AOF persistence. Use a separate active queue (e.g., Kafka) for ready jobs to decouple scheduling from execution.
  • QHow do you ensure exactly-once execution in a job scheduler?Mid-levelReveal
    Exactly-once is theoretically impossible in distributed systems. Instead, aim for at-least-once delivery combined with idempotent workers. Idempotency means that executing the same job multiple times has the same effect as executing it once. Use dedup keys, optimistic locking, or conditional inserts to detect and skip duplicates.
  • QHow do you handle a job that takes longer than the visibility timeout?SeniorReveal
    Implement a heartbeat mechanism: the worker periodically updates a lease key in Redis with a new TTL. The scheduler checks the lease before re-queuing. If the lease is still valid, it leaves the job alone. If the lease expires, the job is re-available for another worker. Ensure the heartbeat interval is less than the visibility timeout.
  • QWhat is the Transactional Outbox pattern and why is it important for job schedulers?SeniorReveal
    The Transactional Outbox pattern ensures atomicity when writing to the database and sending a message to a queue. Instead of directly publishing to the queue, you insert a row in an 'outbox' table within the same DB transaction. A separate relay process reads the outbox and pushes messages to the queue. This guarantees that either both the DB write and queue push happen, or neither does.
  • QHow do you scale the delay queue beyond a single Redis instance?SeniorReveal
    Shard the delay queue based on a hash of the job ID. Use Redis Cluster with proper key distribution. Alternatively, use consistent hashing to map job IDs to shards. Each scheduler instance owns a subset of shards and polls only those. This eliminates lock contention and allows linear scaling.

Frequently Asked Questions

How do you handle 'Heavy' jobs that take hours?

Don't block the worker. The worker should update the job status to 'IN_PROGRESS' in the DB and then spawn a separate process or heartbeat mechanism. If the heartbeat stops, the scheduler assumes the worker died and resets the job.

What is the best way to implement a Delay Queue?

For most scale, Redis Sorted Sets (ZSet) where the score is the unix timestamp. Use ZRANGEBYSCORE to pull jobs that are due now. For extreme scale, specialized databases like RocksDB or tiered storage in Kafka are used.

How do you ensure 'Exactly Once' execution?

In distributed systems, you cannot achieve truly exactly-once delivery. You aim for 'At Least Once' delivery and ensure the 'Execution' is idempotent—meaning if the job runs twice, the second run has no effect.

What happens if the Redis delay queue goes down?

Jobs in memory are lost if Redis is not persisted. Use AOF persistence and snapshotting. Consider a fallback queue (e.g., a table in the metadata store) that the scheduler can fall back to during Redis outages. Also implement a reconciliation process that checks the metadata store for pending jobs after recovery.

How do you handle priority jobs in a job scheduler?

Use separate delay queues for different priority levels (e.g., high, medium, low). The scheduler polls high-priority queues more frequently. Alternatively, use a single queue with a weighted score where score = scheduled_time - priority_boost. Workers pick jobs by scanning the top of the queue.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousDesign a Caching SystemNext →Design a Leaderboard System
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged