Senior 8 min · March 06, 2026
Celery for Task Queues in Python

Celery Tasks — Why Redis LRU Silently Drops Your Jobs

Redis allkeys-lru evicts Celery tasks under memory pressure with zero errors.

N
Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Everything here is grounded in real deployments.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Celery decouples task production from execution via a message broker
  • Workers pull tasks asynchronously — scale horizontally by adding more workers
  • Result backend stores task results; omit if you only need fire-and-forget
  • Performance: broker throughput caps at ~10k tasks/sec on a single Redis instance
  • Production pitfall: task serialization with pickle opens remote code execution — use JSON or msgpack
  • Biggest mistake: assuming tasks run exactly once — at-least-once is the default guarantee
✦ Definition~90s read
What is Celery for Task Queues in Python?

Celery is a distributed task queue for Python that offloads work from your web application processes to dedicated worker processes. When your Flask or Django app needs to send emails, process images, or hit third-party APIs, you don't block the HTTP request—you push a task message onto a broker (typically Redis or RabbitMQ), and a pool of workers picks it up asynchronously.

Imagine a busy restaurant.

The broker acts as a buffer: it holds tasks until workers are ready, decoupling request handling from execution. Without this, your web server would serialize all heavy operations, killing throughput and making your app unresponsive under load.

Celery's architecture splits into three parts: the broker (message transport), workers (processes that execute tasks), and an optional result backend (stores task return values for inspection). Redis is the most common broker choice because it's already in most stacks, but it's not a message queue—it's a key-value store with pub/sub and list primitives.

This matters because Redis's default eviction policy (noeviction or allkeys-lru) can silently delete task messages when memory fills up, causing jobs to vanish without error. RabbitMQ, by contrast, is purpose-built for message queuing and won't drop messages unless explicitly configured to.

Tasks are Python functions decorated with @app.task that get serialized (usually JSON or pickle) and sent to the broker. Serialization is a security boundary: pickle can execute arbitrary code, so production systems should use JSON with explicit argument validation.

Routing lets you direct tasks to specific queues (e.g., 'high-priority', 'image-processing') by binding workers to queue names, enabling resource isolation. Retry strategies—exponential backoff, max retries, and dead letter queues—handle transient failures like database deadlocks or API rate limits, but they don't protect against broker-level data loss.

Canvas primitives (chains, groups, chords) compose tasks into workflows: a chain runs tasks sequentially, a group fans out parallel work, and a chord waits for a group to finish before triggering a callback. These patterns are powerful but amplify the risk of silent drops—if Redis evicts a chord header message, your entire workflow deadlocks with no error.

Plain-English First

Imagine a busy restaurant. The waiter takes your order and hands a ticket to the kitchen — he doesn't stand there waiting for the food to cook. The kitchen works through tickets at its own pace, even if 50 orders pile up. Celery is that ticketing system for your Python app. Your web server hands off slow jobs — sending emails, resizing images, crunching reports — to a queue, then immediately moves on. Workers in the background chew through those jobs whenever they're ready.

Every production web app eventually hits the same wall: a request that takes too long. Maybe it's sending a welcome email, generating a PDF, calling a slow third-party API, or processing an uploaded video. If you do that work inside the HTTP request cycle, your users stare at a spinner — and under traffic, your server buckles. This isn't a Python problem. It's a distributed systems problem, and it needs a distributed systems solution.

Celery solves it by decoupling task production from task execution. Your web process serializes a task, drops it onto a message broker (Redis or RabbitMQ), and returns instantly. Separate worker processes — as many as you need, on as many machines as you want — pull tasks off the queue and execute them independently. You get horizontal scalability, fault isolation, retry logic, scheduling, and observability, all from one library that integrates cleanly with Django, FastAPI, and plain Python alike.

By the end of this article you'll understand how Celery routes messages under the hood, why task serialization choices matter for security, how to build reliable retry strategies that don't hammer downstream services, how to compose complex workflows with Canvas primitives, and which production mistakes cost teams days of debugging. This is not a hello-world walkthrough — it's the mental model you need to run Celery confidently at scale.

What Celery Task Queues Actually Do

Celery is a distributed task queue for Python that offaches work from your web application to worker processes. At its core, it takes a Python function call, serializes it into a message (typically JSON or pickle), pushes that message to a broker (Redis, RabbitMQ, Amazon SQS), and a worker picks it up, deserializes it, and executes it. This decouples request/response from expensive or asynchronous work.

Celery's key properties: tasks are executed at least once (default), ordering is not guaranteed across workers, and result backends are optional. The broker acts as a buffer — if workers are saturated, tasks queue up. If the broker runs out of memory (Redis with LRU eviction), it silently drops the oldest messages. That's the trap: Redis is not a reliable queue by default.

Use Celery when you need to run background jobs (email, image processing, API calls) without blocking your web server. It's battle-tested at scale (Instagram, Pinterest) but requires explicit configuration for reliability — especially around broker persistence, result backends, and retry policies. Without those, your 'reliable' task queue is just a fire-and-forget with extra steps.

Redis LRU Eviction Is Not a Queue Feature
Redis's default maxmemory-policy (noeviction) will reject new tasks when full. Switch to allkeys-lru and it silently drops old tasks — including unprocessed ones.
Production Insight
A team ran a Celery pipeline processing 10M image resizes/day. Redis hit its 4GB limit, LRU evicted the oldest 200K tasks — all unprocessed. The symptom: no errors, just missing images in production for 3 hours.
Exact failure: Redis logs show 'LRU eviction' but Celery workers report nothing — they never see the dropped tasks.
Rule of thumb: never use Redis as a Celery broker without setting maxmemory-policy=noeviction AND monitoring memory usage with alerts at 80%.
Key Takeaway
Celery decouples task submission from execution via a message broker — it's not a queue itself.
Redis as a broker is fast but unreliable by default: LRU eviction silently drops tasks.
Always configure broker persistence, retry policies, and monitoring — or your 'async' system becomes a silent data loss engine.
Celery Task Queue Architecture and Pitfalls THECODEFORGE.IO Celery Task Queue Architecture and Pitfalls Flow from broker to worker with Redis LRU risk Broker (Redis/RabbitMQ) Stores pending tasks in FIFO queue Worker Process Pulls tasks, executes, stores results Task Serialization JSON or pickle; security matters Redis LRU Eviction Drops oldest keys when memory full Result Backend Stores task results for retrieval ⚠ Redis LRU silently evicts unacknowledged tasks Use separate Redis instance or persistent queue for Celery THECODEFORGE.IO
thecodeforge.io
Celery Task Queue Architecture and Pitfalls
Celery Task Queues Python

Celery Architecture: Broker, Workers, and Result Backend

Celery's architecture has three key parts. The broker is the message transport — Redis or RabbitMQ. Your app (the producer) serializes a task and publishes it to an exchange on the broker. Workers are separate processes that subscribe to queues and consume those messages. The result backend is optional — a database where task return values or statuses are stored for later retrieval.

When you call add.delay(2, 2), Celery does this: 1. Serializes the task name, arguments, and kwargs using the configured serializer (JSON by default). 2. Sends the message to the broker's default exchange. 3. The broker routes it to the appropriate queue based on routing key and exchange bindings. 4. An idle worker picks up the message from the queue, deserializes it, and executes the function. 5. If a result backend is configured, the return value is stored. Otherwise the result is gone after execution.

The critical point: the broker and result backend are separate concerns. You can use Redis for both, but in production it's common to use RabbitMQ for reliability and PostgreSQL for results.

app.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# io.thecodeforge.celery_app/app.py
from io.thecodeforge.celery_app import celery_app

@celery_app.task
def send_welcome_email(user_id: int) -> dict:
    # Simulated email sending
    return {"user_id": user_id, "status": "sent"}

# Producer side: result is AsyncResult
from celery.result import AsyncResult
result = send_welcome_email.delay(42)
print(result.id)  # uuid of the task
print(result.status)  # PENDING before execution
print(result.get(timeout=30))  # blocks until done

# Without result backend, get() raises an exception
Mental Model: Think of Celery as a Post Office
  • Producer writes a message (serialized task) and hands it to the post office.
  • The post office (broker) stores the message until a worker picks it up.
  • Worker reads the message, processes it, and optionally writes a receipt (result) to a central ledger.
  • Without a result backend, the receipt is torn up after delivery — you never know if it arrived.
  • Scaling: add more workers (carriers) to handle more mail, or more post offices (broker clusters) for fault tolerance.
Production Insight
Using the same Redis instance for both cache and broker is a common prod failure. Cache evictions (LRU) silently remove pending tasks.
Separate your broker Redis database (db=1) from the cache database (db=0). Use a dedicated Redis cluster for tasks if throughput > 1000 tasks/second.
Rule: the broker must never evict data — set maxmemory-policy noeviction and monitor memory usage.
Key Takeaway
The broker holds tasks in flight, the result backend tracks outcomes.
If you don't need to know the result, skip the result backend.
Never share the broker's Redis database with caching — use dedicated instances.
Choosing a Broker & Result Backend
IfLow volume, simple tasks, need fast setup
UseUse Redis as broker and Redis as result backend (single-server).
IfHigh reliability, need persistence
UseUse RabbitMQ as broker (persistent messages), PostgreSQL or RDBMS as result backend.
IfFire-and-forget tasks, no result needed
UseOmit result backend entirely. Save memory and simplify ops.
IfDistributed workers across regions
UseUse RabbitMQ with mirrored queues or Redis Sentinel for HA.

Task Definitions, Serialization, and Security

Tasks are defined by decorating functions with @app.task. The function signature, module path, and arguments are serialized and sent to workers. By default, Celery uses JSON serializer — safe and fast, but cannot handle custom Python objects, datetimes, or sets.

The serializer choice is a security boundary. The pickle serializer can deserialize arbitrary Python objects, opening a Remote Code Execution (RCE) vulnerability if an attacker can inject messages. In production, always restrict serializers to JSON or msgpack. If you need custom types, register custom encoders rather than falling back to pickle.

Task options like bind=True pass the task instance as the first argument (self), enabling access to task state, request, and retry methods. name overrides the autogenerated path-based name, useful for versioning or renaming tasks without breaking pending messages.

A subtle gotcha: when you rename the function or move it to a different module, tasks already in the queue with the old name will fail with NotRegistered. Always use an explicit name and map old names with task_names routing if doing a rolling deployment.

tasks.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# io/thecodeforge/celery_app/tasks.py
from io.thecodeforge.celery_app import celery_app

# Using JSON serializer (default) - safe for primitives
@celery_app.task(
    name="io.thecodeforge.send_report",
    serializer="json",
    autoretry_for=(ConnectionError, TimeoutError),
    max_retries=3,
)
def send_report(report_id: int, email: str) -> dict:
    # ... send report
    return {"report_id": report_id, "email": email, "status": "sent"}

# Custom serializer: handle datetime objects
from datetime import datetime
from io.thecodeforge.celery_app import celery_app

@celery_app.task(serializer="msgpack")
def schedule_backup(timestamp: datetime):
    # msgpack can handle datetime
    pass

# Explicit name to survive module moves
@celery_app.task(name="legacy.process_payment")
def process_payment(user_id: int, amount: float):
    pass
Security: Never Use Pickle in Production
If your Celery config has task_serializer = 'pickle', any attacker who can send messages to the broker (e.g., through a compromised web endpoint) can execute arbitrary code. Always set accept_content = ['json'] and task_serializer = 'json' in your Django or Flask config. Use msgpack if you need performance or special types. Let the warning be clear: pickle in production is a matter of when, not if, you'll get pwned.
Production Insight
A team ran Celery with pickle for years — until an SSRF vulnerability allowed an attacker to craft malicious task payloads. Workers became remote code execution servers.
Switch to JSON/msgpack and explicitly set accept_content to a whitelist. For legacy tasks, use the @celery_app.task(serializer='pickle') per-task override, never globally.
Rule: restrict accepted content types. If you don't, you're trusting every API client that touches the broker.
Key Takeaway
JSON is safe and fast. msgpack is faster and handles more types. Pickle is a loaded weapon.
Explicitly set accept_content and task_serializer in your Celery config.
For renamed tasks, use explicit name and old name mapping during rolling upgrades.

Routing Tasks to Queues and Workers

By default, all tasks go to a single queue named celery. That works for small setups, but as you grow, you need to separate concerns: one queue for email tasks, another for image processing, a third for report generation. This prevents a burst of image tasks from starving email delivery.

Celery routing works through the AMQP model (or simple Redis key matching). You define a Queue object with an exchange and routing key. Then assign a task's queue option, or use a task_routes config dict to route by task name pattern.

On the worker side, you specify which queues to consume with the -Q flag. A single worker can handle multiple queues, but it's common to have dedicated workers per queue for resource isolation (e.g., CPU-heavy image workers vs I/O-heavy email workers).

The classic mistake: forgetting to restart workers after adding new routing rules. Stale workers don't subscribe to new queues, tasks pile up in unbound queues, and you debug for hours before noticing.

config.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# io/thecodeforge/celery_app/config.py
from kombu import Queue, Exchange

celery_app.conf.task_queues = [
    Queue('default',    Exchange('default'),      routing_key='default'),
    Queue('emails',     Exchange('emails'),       routing_key='email.*'),
    Queue('images',     Exchange('images'),       routing_key='image.*'),
    Queue('reports',    Exchange('reports'),      routing_key='report.*'),
]

# Route by task name pattern
celery_app.conf.task_routes = {
    'io.thecodeforge.tasks.*': {'queue': 'default'},
    'io.thecodeforge.tasks.send_welcome_email': {'queue': 'emails'},
    'io.thecodeforge.tasks.resize_image': {'queue': 'images'},
    'io.thecodeforge.tasks.generate_sales_report': {'queue': 'reports'},
}

# Start a worker that only processes emails:
# celery -A io.thecodeforge.celery_app worker -Q emails -c 10
Production Insight
A misrouted task can silently sit in a queue no worker is listening to. The producer gets no error — the broker accepts it, but it's never consumed.
Always set task_routes explicitly and monitor queue lengths with a tool like Flower or a simple script. Alert if any queue exceeds expected depth.
Rule: every queue must have at least one active worker subscriber. Monitor celery inspect active_queues in your healthcheck.
Key Takeaway
Use separate queues for different task types to isolate failures and resource consumption.
Start workers with -Q queue1,queue2 to subscribe to specific queues.
Monitor queue depth — if a queue grows unbounded, no one is consuming it.

Retry Strategies: Exponential Backoff, Max Retries, and Dead Letter Patterns

Distributed systems fail. Networks drop, databases restart, third-party APIs rate-limit. Celery provides built-in retry mechanisms through the autoretry_for task option and the retry() method callable from inside a task.

autoretry_for takes a tuple of exception classes. When a task raises one of those, Celery automatically re-queues it with a delay. You control the backoff with retry_backoff, retry_backoff_max, and retry_jitter options. Default backoff is exponential: wait 1s, then 2s, 4s, 8s... up to retry_backoff_max (default 10 minutes).

For more fine-grained control, call self.retry(countdown=60) inside the task. You can pass exc to log the original exception, and max_retries to cap attempts.

A retry that succeeds on the Nth attempt does not make the previous failures visible. For critical tasks, log each retry and its exception. Consider implementing a dead letter pattern: after exhausting retries, route the task to a separate dead letter queue for manual inspection, rather than discarding it silently.

The biggest retry mistake: retrying on exceptions that won't recover (e.g., ValueError due to invalid input). You'll burn cycles and never succeed. Only retry on transient failures.

tasks_retry.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# io/thecodeforge/celery_app/tasks_retry.py
from io.thecodeforge.celery_app import celery_app
import requests

# Autoretry with exponential backoff
@celery_app.task(
    autoretry_for=(requests.ConnectionError, requests.Timeout, IOError),
    retry_backoff=True,        # 1, 2, 4, 8, ... seconds
    retry_backoff_max=300,     # max 5 minutes between retries
    retry_jitter=True,         # add randomness to avoid thundering herd
    max_retries=5,
)
def fetch_api_data(url: str) -> dict:
    resp = requests.get(url, timeout=5)
    resp.raise_for_status()
    return resp.json()

# Manual retry with dead letter logic
@celery_app.task(bind=True, max_retries=3)
def process_payment(self, user_id: int, amount: float):
    try:
        gateway_response = call_payment_gateway(user_id, amount)
        if not gateway_response['success']:
            raise ValueError(gateway_response.get('error', 'unknown'))
        return gateway_response
    except Exception as exc:
        # Log each retry
        logger.error(f"Payment failed for user {user_id}: {exc}")
        if self.request.retries >= self.max_retries:
            # Dead letter: route to a special queue
            send_to_dead_letter_queue('payments', self.request)
            return {'status': 'failed', 'reason': str(exc)}
        raise self.retry(exc=exc, countdown=2 ** self.request.retries * 5)
Production Insight
A fintech task retried 10 times on a non-recoverable validation error. Each retry hit the slow payment gateway, causing cascading timeouts across workers.
Only retry on transient exceptions. Use autoretry_for with a carefully chosen list. For non-transient failures, log to a dead letter database or alert immediately.
Rule: retry is not a fix for incorrect data — it's a strategy for unreliable networks.
Key Takeaway
Retry only transient failures. Set max_retries to avoid infinite loops.
Use exponential backoff with jitter to avoid thundering herd.
Implement a dead letter queue for tasks that exhaust retries — don't let them vanish.

Canvas: Chains, Groups, Chords, and Complex Workflows

Celery's Canvas primitives let you compose multiple tasks into workflows. A chain runs tasks sequentially, passing the result of each to the next. A group runs tasks in parallel and returns a list of results. A chord is a group followed by a callback that receives the list of group results.

Production patterns
  • fan-out: use a group to send multiple notifications simultaneously.
  • pipeline: use a chain for sequential steps like validate → enrich → persist.
  • map-reduce: use a chord: group processes chunks, callback aggregates results.

Crucial insight: the chord callback only executes after all group tasks have completed and stored their results in the result backend. If the result backend is down, the chord never fires. Similarly, if any group task fails permanently, the chord blocks forever unless you set chord_size and implement timeouts.

Another gotcha: chain tasks share the same connection — a failure in the middle can leave the chain incomplete. Wrap each link in its own error handling.

canvas_examples.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# io/thecodeforge/celery_app/canvas_examples.py
from io.thecodeforge.celery_app import celery_app
from celery import chain, group, chord

# Chain: step1 -> step2 -> step3
result_chain = chain(
    io.thecodeforge.tasks.step1.s(data=100),
    io.thecodeforge.tasks.step2.s(),
    io.thecodeforge.tasks.step3.s()
)()
print(result_chain.get())  # final result

# Group: parallel execution
parallel_results = group(
    io.thecodeforge.tasks.process_chunk.s(chunk=1),
    io.thecodeforge.tasks.process_chunk.s(chunk=2),
    io.thecodeforge.tasks.process_chunk.s(chunk=3),
)()
print(parallel_results.get())  # [res1, res2, res3]

# Chord: parallel then callback
callback = chord(
    header=[
        io.thecodeforge.tasks.compute_metric.s(user_id=1),
        io.thecodeforge.tasks.compute_metric.s(user_id=2),
    ],
    body=io.thecodeforge.tasks.aggregate_results.s()
)()
print(callback.get())  # aggregated result

# Warning: chord body requires result backend. If backend is down, chord hangs.
# Always set a chord timeout:
from celery import current_app
current_app.conf.task_soft_time_limit = 120
Chord Gotcha: Result Backend Dependency
The chord primitive stores intermediate results in the result backend. If the result backend is unavailable or has high latency, the chord will block or fail. Always ensure your result backend is highly available if you use chords. For simple parallel tasks that don't need the aggregated result, use a group and handle results in the application layer.
Production Insight
A team used a chord to generate monthly reports. One group task failed due to a misconfigured dependency — the chord blocked waiting for it for 8 hours, consuming worker memory.
Set a chord timeout: use app.conf.task_soft_time_limit for the body task. For critical workflows, implement a timeout in your application code after body.get(timeout=minutes).
Rule: chords escalate failures from one subtask to the entire workflow. Use link_error callbacks or separate error queues.
Key Takeaway
Chains for sequential, groups for parallel, chords for parallel-then-aggregate.
Chords require a reliable result backend — monitor its health.
Always set timeouts on chords and handle partial failures explicitly.

Monitoring and Observability in Production

Reading logs isn't enough. You need to see queue depth, worker heartbeats, task latency percentiles, and error rates. The Celery ecosystem provides Flower for a web UI, but for production you'll want to integrate with Prometheus/Grafana or DataDog.

Essential metrics
  • Queue depth: number of unacknowledged messages per queue. Alert if > expected baseline.
  • Worker alive count: number of active workers. If drops below N, alert.
  • Task latency: time between task creation and execution start. Spikes indicate bottleneck.
  • Task error rate: percentage of tasks raising exceptions. Unexpected errors need investigation.
  • Result backend latency: if using a database as result backend, monitor query time.

Celery also has built-in events (celery events) and a stats endpoint (celery inspect stats). You can enable task ETA monitoring to catch tasks that miss their scheduled time.

Common blind spot: prefetch count. By default, a worker prefetches up to 4 tasks (worker_prefetch_multiplier=4). If tasks are imbalanced, one worker can hog tasks while others idle. Set to 1 for fair scheduling at a slight throughput cost.

monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# io/thecodeforge/celery_app/monitoring.py
from io.thecodeforge.celery_app import celery_app
from celery.events import Events
from celery.utils import uuid

# Quick health check
from celery.task.control import inspect

def worker_count() -> int:
    i = inspect()
    active_workers = i.active()
    return len(active_workers) if active_workers else 0

def queue_depth(queue_name: str = 'celery') -> int:
    # Requires broker-specific code; example for Redis
    import redis
    r = redis.Redis(host='localhost', port=6379, db=1)
    return r.llen(queue_name)

# Export to Prometheus via custom exporter
def export_task_metrics(task_name, latency_ms, status):
    # Pseudocode: push to Prometheus client
    pass

# Celery events for real-time monitoring
@celery_app.on_after_configure.connect
def setup_events(sender, **kwargs):
    sender.conf.task_events = True
    sender.conf.task_send_sent_event = True

# In deployment, run:
# celery -A io.thecodeforge.celery_app events --camera=<camera_class> --frequency=10
Production Insight
A team lost a batch of scheduled tasks because the clock on a worker VM drifted 30 seconds ahead. The ETA-based tasks were marked as late and never executed.
Sync your worker clocks with NTP. For time-sensitive tasks, use Redis as the scheduler backend (celery beat with Redis) rather than relying on system clocks.
Rule: monitor task ETA accuracy and worker clock skew if using schedule-based tasks.
Key Takeaway
Monitor queue depth, worker count, task latency, and error rate in production.
Use Flower for development, Prometheus/Grafana for production.
Fair worker distribution: set worker_prefetch_multiplier=1 for balanced load.

Why Celery Beat Fails Silent on Daylight Saving

Your periodic tasks run fine for months, then suddenly skip an hour twice a year. The culprit is Celery Beat's cron scheduler combined with naive datetime handling. When your system clock jumps at 2 AM, Beat loses track of which tasks it already fired. You get double executions on fall-back, zero on spring-forward. The fix is brutal: never use cron-style schedules for time-sensitive work within a UTC offset change. Switch to 'solar' schedules for actual time-of-day tasks, or compute absolute timestamps and re-enqueue them. Better yet, run all your Celery infrastructure in UTC and only convert to local time inside the task logic. I've debugged this at 3 AM during a deployment. The logs showed nothing. No errors. Just missing invoice emails. Trust me: UTC everywhere, or schedule like a developer, not a clock maker.

tasks.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// io.thecodeforge
from celery import Celery
from celery.schedules import crontab, solar

app = Celery('billing')

# BAD: Breaks during DST transitions
app.conf.beat_schedule = {
    'send-invoices': {
        'task': 'billing.tasks.send_invoices',
        'schedule': crontab(hour=2, minute=0),
    },
}

# GOOD: Solar schedule immune to clock drift
app.conf.beat_schedule = {
    'send-invoices-safe': {
        'task': 'billing.tasks.send_invoices',
        'schedule': solar('sunset', 40.7128, -74.0060),
    },
}

# BETTER: UTC crontab, convert inside task
@app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
    sender.add_periodic_task(
        crontab(hour=7, minute=0),  # 07:00 UTC = 02:00 EST
        send_invoices.s()
    )
Output
No error until DST change; after fix: all invoices sent at correct local time
Production Trap:
If you see duplicate Celery Beat executions after a system time change, immediately audit all crontab schedules and switch to UTC-based timings. Your logs won't warn you—this is a silent data corruption bug.
Key Takeaway
Schedule Celery Beat tasks in UTC or use solar schedules; never trust local cron with DST transitions.

Prefork vs Gevent: Know When Your Workers Starve

You set CELERY_WORKER_CONCURRENCY to 16 because you have 8 cores. Now your memory is spiking to 4 GB and tasks that call external APIs timeout randomly. Here's what happened: prefork workers are eager. Each child process copies the entire Python interpreter. If your task imports a heavy ML model or loads a large configuration at module level, every single worker pays that cost. Worse, prefork with blocking I/O wastes CPU cycles waiting on network responses. Switch to gevent workers when your tasks spend most time waiting—HTTP calls, database queries, file uploads. Gevent uses greenlets: lightweight, shared memory, cooperative multitasking. Set CELERY_WORKER_POOL to 'gevent' and keep concurrency at 2-3x your CPU cores, not 2x. But watch out: gevent hates C extensions that don't release the GIL. If you use numpy or pandas heavily inside tasks, stay on prefork but preload the heavy modules with C_FORCE_ROOT=1 and --pool=solo for cold starts.

celeryconfig.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge
from celery import Celery

app = Celery('data_pipeline')

# BAD: Prefork with blocking I/O on 8 cores
app.conf.worker_concurrency = 16  # 2x cores
app.conf.worker_pool = 'prefork'
# Result: 16 Python processes, each 200MB = 3.2GB RAM

# GOOD: Gevent for I/O-bound tasks
app.conf.worker_concurrency = 24  # 3x cores
app.conf.worker_pool = 'gevent'
# Result: 1 process with 24 greenlets, ~200MB total

# HYBRID: Separate queue for CPU-heavy tasks
app.conf.task_routes = {
    'ml.*': {'queue': 'cpu_heavy', 'pool': 'prefork', 'concurrency': 4},
    'api.*': {'queue': 'io_bound', 'pool': 'gevent', 'concurrency': 100},
}
Output
Memory dropped from 3.2GB to 200MB; I/O-bound tasks completed 5x faster
Production Trap:
If you see 'Out of Memory: Killed process' in dmesg and you're using prefork with > 4 workers, you're paying for child process overhead. Switch to gevent unless you have CPU-bound tasks.
Key Takeaway
Match worker pool to bottleneck: prefork for CPU, gevent for I/O. Never blindly set concurrency to core count.
● Production incidentPOST-MORTEMseverity: high

The Disappearing Task: When Celery Silently Drops Tasks

Symptom
Tasks are submitted with .delay() but never appear in the worker logs. No error, no trace. The broker shows no pending messages eventually.
Assumption
The team assumed Celery guarantees delivery as long as the broker is reachable at submission time. They didn't configure a result backend or task tracking.
Root cause
The Redis broker has a maxmemory policy set to allkeys-lru. When memory pressure hits, Redis evicts the oldest unacknowledged task messages. Celery's broker transport (redis) doesn't persist messages by default — they exist only in memory.
Fix
Set Redis maxmemory-policy to noeviction for the Celery broker database. Use a separate Redis instance or database (e.g., db=1) dedicated to Celery, away from caching. Enable task result backend with result_expires to track task lifecycle.
Key lesson
  • Never assume in-memory brokers give durability. Always pair Celery with a persistent result backend for production.
  • Redis as a broker requires careful memory management — separate Redis instances for broker and cache.
  • Add monitoring: worker heartbeat, queue depth, and task time-to-execute alerts catch silent failures.
Production debug guideQuick triage for the most common production issues5 entries
Symptom · 01
Task submitted but never runs
Fix
Check worker is alive (celery -A proj status). Verify queue name matches worker's -Q flag. Inspect broker queue length (redis-cli llen celery or rabbitmqctl list_queues).
Symptom · 02
Task runs multiple times (duplicate execution)
Fix
Check for acks_late=True combined with worker pre-fetch. Switch to acks_late=False unless idempotency is guaranteed. Enable task visibility_timeout on Redis (default 1 hour).
Symptom · 03
Worker OOM or crash after a specific task
Fix
Enable task profiling with --profiler. Set worker_max_tasks_per_child to recycle workers after N tasks. Use task_soft_time_limit and task_time_limit to kill runaway tasks.
Symptom · 04
Tasks stuck in PENDING state forever
Fix
Ensure result backend is configured and reachable. Celery async results require a running result backend; if it's down, AsyncResult blocks indefinitely. Check database connection pool for the result backend.
Symptom · 05
High latency between submission and execution
Fix
Monitor queue depth. Add more workers or increase concurrency. Check if a long-running task is blocking the prefetch — set worker_prefetch_multiplier=1 to distribute fairly.
★ Celery Quick Debug Cheat SheetCommands and checks for the 5 most common Celery production failures
Task never executes
Immediate action
Restart worker with debug logging
Commands
celery -A io.thecodeforge.celery_app worker -l debug --concurrency=1
redis-cli -n 1 LLEN celery
Fix now
Purge the queue with celery -A io.thecodeforge.celery_app purge -f and resubmit the task
Task executed multiple times+
Immediate action
Check acks_late setting
Commands
celery -A io.thecodeforge.celery_app inspect active
redis-cli -n 1 CLIENT LIST | grep -i blocked
Fix now
Set task_acks_late = False in config unless idempotent
Worker crashes with MemoryError+
Immediate action
Set resource limits and restart with max tasks per child
Commands
celery -A io.thecodeforge.celery_app control rate_limit io.thecodeforge.celery_app.tasks.* 1/m
celery -A io.thecodeforge.celery_app inspect stats | grep total
Fix now
Add worker_max_tasks_per_child = 200 and worker_max_memory_per_child = 50000
AsyncResult.get() hangs forever+
Immediate action
Check if result backend is reachable
Commands
celery -A io.thecodeforge.celery_app inspect registered
psql -h dbhost -U user -c 'SELECT count(*) FROM celery_taskmeta'
Fix now
Set result_backend = 'db+sqlite:///results.db' for local testing to isolate the issue
Task timing out but not killed+
Immediate action
Verify soft and hard time limits
Commands
celery -A io.thecodeforge.celery_app inspect query
celery -A io.thecodeforge.celery_app status
Fix now
Set task_soft_time_limit = 300 (5 minutes) and task_time_limit = 600 in app config
Broker Comparison: Redis vs RabbitMQ for Celery
FeatureRedisRabbitMQ
Persistence (durable messages)Only if AOF/RDB enabled; memory-firstFully persistent with confirmed publishes
Throughput (default config)~10k tasks/sec~20k tasks/sec
High AvailabilityRedis Sentinel or ClusterMirrored queues, quorum queues
Latency (p99)~1ms~2ms
Ease of setupVery simple, single binaryRequires config, management UI
Security (RCE risk)Pickle still possibleSame risk if pickle enabled
Result backend usageNative support, fastNot recommended; use separate backend

Key takeaways

1
Celery decouples task production from execution via a broker
workers scale independently.
2
Choose the right broker
Redis for simplicity, RabbitMQ for reliability.
3
Never use pickle in production
it's a remote code execution vector.
4
Use separate queues for different task types to isolate failures.
5
Retry only transient exceptions; implement dead letter queues for permanent failures.
6
Monitor queue depth, worker count, and task latency
not just logs.
7
Canvas workflows (chains, groups, chords) are powerful but depend on a healthy result backend.

Common mistakes to avoid

5 patterns
×

Using pickle serializer in production

Symptom
A security audit reveals that workers can execute arbitrary Python code. An SSRF vulnerability in a web endpoint allows attackers to inject malicious task payloads.
Fix
Set task_serializer = 'json' and accept_content = ['json'] in your Celery app config. Migrate existing tasks to JSON or msgpack; use custom encoders for complex types.
×

Not configuring a result backend but then calling .get() on results

Symptom
Task returns successfully, but AsyncResult.get() raises OperationalError or hangs forever.
Fix
Either configure a result backend (e.g., Redis, database) or don't call .get(). For fire-and-forget tasks, omit the result backend entirely.
×

Ignoring worker prefetch multiplier for unbalanced tasks

Symptom
Some workers sit idle while others have 10+ tasks queued. Overall throughput is lower than expected.
Fix
Set worker_prefetch_multiplier = 1 to distribute tasks more evenly, especially if tasks vary in duration. Monitor with celery inspect active.
×

Retrying on non-transient exceptions (e.g., invalid input data)

Symptom
A task fails 5 times with the same ValueError before exhausting retries. Logs show repeated identical failures.
Fix
Only use autoretry_for with transient exceptions (network timeouts, database disconnects). For validation errors, log and fail fast. Implement dead letter queue for persistent failures.
×

Not setting timeouts on chords, causing workflows to block forever

Symptom
A chord with 10 subtasks hangs because one subtask never completes. The callback never fires, and the entire pipeline stalls.
Fix
Set task_soft_time_limit and task_time_limit on the chord body task. Implement external timeout in your application code for the chord result. Use link_error to handle subtask failures.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Celery guarantee at-least-once delivery, and what scenarios can...
Q02SENIOR
Explain how Celery routing works with AMQP and how you would set up sepa...
Q03SENIOR
What are the security implications of using pickle as a serializer in Ce...
Q04SENIOR
Describe the difference between a chain, a group, and a chord in Celery ...
Q05SENIOR
Your Celery tasks are piling up in the queue but workers show zero memor...
Q01 of 05SENIOR

How does Celery guarantee at-least-once delivery, and what scenarios can cause duplicate execution?

ANSWER
By default, Celery acknowledges a task after execution (acks_late=False). If the worker crashes after acknowledgement but before the result is stored, the task is lost. With acks_late=True, the task is acknowledged after execution, so a crash during execution causes the task to be redelivered to another worker. This creates duplicates. Celery never guarantees exactly-once delivery; you must implement idempotency in your tasks. Duplicates can also occur if the broker (e.g., Redis) redelivers due to timeout or if you manually re-queue a task.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is Celery for Task Queues in Python in simple terms?
02
When should I use Celery vs just using Python threads or asyncio?
03
Can I use Celery with Django or FastAPI?
04
How do I monitor Celery in production?
05
How do I handle long-running tasks without blocking other tasks?
N
Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Everything here is grounded in real deployments.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's Python Libraries. Mark it forged?

8 min read · try the examples if you haven't

Previous
FastAPI Basics
17 / 51 · Python Libraries
Next
Pytest Fixtures