Python Threading vs Multiprocessing: Race Condition Gotcha
Duplicate entries and HTTP 429 errors from concurrent list pops.
20+ years shipping production Python across data and backend systems. Drawn from code that ran under real load.
- Threading shares memory but is limited by Python's GIL to one bytecode thread at a time.
- Multiprocessing runs separate processes, each with its own GIL, enabling full CPU parallelism.
- Threads excel at I/O-bound tasks; processes excel at CPU-bound tasks.
- Sharing data between processes requires serialization (pickle) which adds overhead.
- Deadlocks and race conditions occur in both—knowing lock ordering prevents them.
- Profile before optimizing: a wrong choice can slow your app by 10x.
Imagine you're running a restaurant. Threading is like having one chef who switches rapidly between cooking multiple dishes — they look busy simultaneously, but only one hand moves at a time. Multiprocessing is like hiring several completely separate chefs, each with their own kitchen, stove, and ingredients. The single chef (threading) works great for waiting on the oven timer or a delivery; the separate kitchens (multiprocessing) shine when every chef needs to chop vegetables at full speed simultaneously. Python's quirky rule — the GIL — is the reason one chef model exists at all.
Every Python developer eventually hits the wall: their code is slow, the CPU is barely breaking a sweat, and adding a loop makes it worse. At that moment, concurrency stops being a theory and becomes urgent. Threading and multiprocessing are Python's two primary answers to that problem, and choosing the wrong one doesn't just cost performance — it can introduce bugs that only appear in production at 3 AM under heavy load.
The core problem both tools solve is the same: doing more than one thing at a time. But the reason your choice matters so much is the Global Interpreter Lock — the GIL. CPython, the standard Python interpreter, uses a mutex that allows only one thread to execute Python bytecode at any given moment. This single design decision splits Python's concurrency world in two: threads that share memory but battle the GIL, and processes that sidestep the GIL entirely by running in separate interpreter instances at the cost of higher overhead and no shared memory by default.
By the end of this article you'll understand exactly when threads win, when processes win, how to safely share data between both, how to avoid the race conditions and deadlocks that bite even experienced engineers, and how to profile your choice to confirm it actually helps. We'll go deep into the CPython internals that explain the behaviour, not just the surface-level API.
Why Python Threading and Multiprocessing Are Not Interchangeable
Threading and multiprocessing are two strategies for achieving concurrency in Python, but they differ fundamentally in how they handle execution and memory. Threading runs multiple tasks within a single process, sharing the same memory space, while multiprocessing spawns separate processes, each with its own memory. The core mechanic is that threads are lightweight and share state, but the Global Interpreter Lock (GIL) prevents true parallel execution of Python bytecode across threads. Multiprocessing bypasses the GIL by using separate processes, enabling true parallelism on multi-core systems.
In practice, threading is best for I/O-bound tasks—like network requests or file reads—where threads spend most time waiting, not computing. Multiprocessing shines for CPU-bound tasks—like numerical simulations or image processing—where you need to saturate all cores. The key property to remember: threads share memory, so race conditions are a real risk; processes have isolated memory, requiring explicit communication (e.g., queues, pipes). Overhead matters: spawning a process is far heavier than starting a thread.
Use threading when your bottleneck is I/O latency (e.g., serving 10,000 concurrent HTTP requests) and multiprocessing when your bottleneck is CPU throughput (e.g., processing 1 million log entries per second). Choosing wrong can degrade performance: using threads for CPU-bound work adds GIL contention, making it slower than single-threaded execution. In production, this distinction is critical for building responsive, scalable systems.
The Global Interpreter Lock (GIL) — Why Threads Are Not Parallel
CPython's GIL ensures only one thread executes Python bytecode at any instant. This isn't a bug — it simplified CPython's memory management and made C extension modules easier to write. But it also means that CPU-bound Python threads do not run in parallel on multiple cores; they take turns. For I/O-bound tasks, threads are still useful because they release the GIL while waiting for I/O, allowing other threads to run.
Let's visualise: imagine a lock that each thread must hold before it can do any Python work. When a thread does a blocking I/O call — like reading from a socket — it releases the lock, and another thread can grab it. This is why threading works for web scraping, database queries, and file downloads. But if you're doing pure math in a loop, no I/O happens, the lock is never released, and you get no parallelism — often even slower due to the overhead of context switching.
The GIL is re-acquired after every 100 bytecode instructions (Python 3.2+) or on I/O. This interval is adjustable via , but don't change it unless you're profiling. Even with shorter intervals, CPU-bound threads still fight for the lock.sys.setswitchinterval()
- Key = GIL: only one thread holds it at any moment.
- If the thread is waiting for I/O, it voluntarily releases the key.
- If the thread is computing (CPU-bound), it holds the key until its time slice ends.
- Multiple cores don't help because the key isn't divisible.
Multiprocessing — True Parallelism but Higher Overhead
The multiprocessing module spawns separate Python processes, each with its own interpreter, memory space, and — crucially — its own GIL. This means you can actually use all CPU cores for parallel computation. But this freedom comes at a cost: creating a process is expensive (forking or spawning takes tens of milliseconds), and sharing data between processes requires serialization (pickling) which adds overhead and limits what can be shared.
Common patterns: Pool for map-reduce style parallelism, Process for long-running workers, and Queue or Pipe for inter-process communication. Shared memory via multiprocessing.Value or Array can avoid serialization but only works for primitive types. Manager objects allow sharing Python objects across processes but are slower due to a server process mediating access.
When you call Pool.map(), the data is split into chunks, each chunk is pickled, sent over a pipe to a worker process, unpickled, computed, re-pickled, and sent back. This overhead can dominate if each task is tiny. Use chunksize parameter to batch multiple tasks per call, reducing IPC overhead.
dill for advanced cases, but prefer simple data structures. If you get a PicklingError, move the function to the module level.os.cpu_count() or lower.with Pool() as pool: context manager to ensure cleanup.Pool for many small tasks, Process for long-running workers.Pool with processes=os.cpu_count().Process and Queue for results.Pool.map with appropriate chunksize to reduce IPC overhead.Sharing State Between Threads and Processes
Threads share everything: same address space, same Python objects. That's convenient but dangerous. Without proper synchronization, two threads can read and write the same variable in unpredictable ways — a race condition. Python's threading.Lock is the basic tool to protect critical sections. Use with lock: blocks around all access to shared mutable state.
Processes do not share memory by default. To share data, you must use explicit IPC mechanisms: - multiprocessing.Queue: thread- and process-safe FIFO, great for producer-consumer. - multiprocessing.Pipe: faster but only for two endpoints. - multiprocessing.Value / multiprocessing.Array: raw shared memory for C types (ctypes). Requires locking on writes. - multiprocessing.Manager: creates a server process that proxies Python objects, easier but much slower.
Each approach has trade-offs between speed and flexibility. Default to Queue unless you have a strong reason not to. Manager objects are convenient but add 2-5x latency per access because every attribute access crosses a pipe.
For thread safety beyond locks, consider data or immutable data structures. Avoid relying on the GIL to protect shared state — it doesn't protect against context switches between bytecode instructions.threading.local()
acquire() to detect this. Example: lock.acquire(timeout=5) raises TimeoutError if not acquired.threading.Lock around all access.Choosing Between Threading, Multiprocessing, and asyncio
Python offers three main concurrency tools: threading, multiprocessing, and asyncio. The right choice depends on the nature of your workload. - Threading: best for I/O-bound tasks where you have many concurrent operations, especially when you need true parallelism in waiting (e.g., web scraping, database queries). Threads are lightweight and share memory, making coordination simple if done correctly. - Multiprocessing: best for CPU-bound tasks where you need to leverage multiple cores. Each process runs independently, so you avoid the GIL. Overhead is higher, and inter-process communication is slower. - asyncio: best for I/O-bound tasks with a single thread, using cooperative multitasking via an event loop. It eliminates the overhead of thread switching and race conditions on shared state, but you must use async-friendly libraries and cannot block the event loop.
In practice, many senior developers mix these: use asyncio for network I/O, and farm out CPU-heavy work to a multiprocessing pool (using loop.run_in_executor). This gives you the scalability of async I/O with the parallelism of processes.
One more nuance: if your I/O-bound task involves many concurrent connections (thousands), asyncio scales better than threading because threads have overhead per thread (~8MB stack). asyncio's overhead is ~2KB per task. For 10,000 connections, asyncio is the clear winner.
run_in_executor with multiprocessing poolI/O-bound vs CPU-bound: Quick Decision Table
Before choosing a concurrency model, you must classify your workload. Two broad categories exist: I/O-bound tasks spend most of their time waiting for external resources (network, disk, user input), while CPU-bound tasks spend most of their time computing. Python's GIL punishes CPU-bound work when using threads, but I/O-bound work benefits from threads because the GIL is released during waits.
Use this decision table to match your workload to the right concurrency tool. The table shows typical scenarios and recommended approaches based on real-world performance characteristics.
time.perf_counter() around the I/O vs compute sections.Using concurrent.futures for a Unified Interface
Python's concurrent.futures module provides a high-level abstraction for running tasks asynchronously using thread or process pools. It exposes the same API for both: ThreadPoolExecutor and ProcessPoolExecutor. This unified interface lets you switch between threading and multiprocessing with minimal code changes — often just the class name.
Executor.submit(fn, args, *kwargs): returns a Future that represents the pending result.Executor.map(fn, *iterables, timeout=None): returns an iterator of results, preserving order.Future.result(timeout=None): blocks until the result is available.as_completed(futures): yields Futures as they complete (order not guaranteed).
Using concurrent.futures is idiomatic Python. It handles worker lifecycle, task distribution, and result collection. It also supports callbacks via . Let's see it in action.future.add_done_callback()
with statement for automatic cleanup. For long-running services, create the executor once and reuse it. Use max_workers based on your hardware and workload: for I/O, set high (e.g., 10-20); for CPU, set to os.cpu_count(). Avoid submitting millions of tasks — batch them.threading.Thread and a shared list. Replacing it with ThreadPoolExecutor eliminated the race condition (the executor passes tasks via an internal queue) and cut code by 60%. The unified interface also made it trivial to switch to processes when we later added a CPU-heavy parsing step.concurrent.futures provides a clean, unified API for both threading and multiprocessing.Visualizing Worker Pools: How Tasks Are Distributed
In practice, the pool dispatcher (the executor) manages a fixed-size pool of workers. When you call , the task is placed in an internal queue. As soon as a worker becomes idle, it grabs the next task from the queue. With submit(), the entire iterable is chunked and distributed. The diagram above abstracts the key components: tasks enter a queue, the pool dispatcher routes them to workers, and results are gathered in order.map()
For ProcessPoolExecutor, the queue is a multiprocessing.Queue (based on pipes) and each worker is a separate process. For ThreadPoolExecutor, the queue is a queue.Queue (thread-safe) and workers are threads. The dispatcher logic is essentially the same — only the backend differs.
Understanding this flow helps debug two common pitfalls: 1. Starvation: If all workers are blocked on a long task, no new tasks can run. Ensure tasks are reasonably sized or use chunksize for small tasks. 2. Queue overload: If tasks arrive faster than workers can process them, the queue grows unboundedly and memory consumption spikes. Use a bounded queue (default in Python) and monitor queue size.
executor.submit() blocks until space is available. This prevents memory blow-up but can throttle the producer. For high-throughput systems, consider using a Semaphore to limit in-flight tasks.chunksize so each worker picks up multiple tasks per queue read, reducing lock contention. Monitoring with strace showed fewer read syscalls after the change.Performance Profiling and Debugging Concurrency Issues
Never assume your concurrency choice makes things faster. Always profile before and after. Python's built-in cProfile works with multithreaded programs but only shows the main thread's perspective. For multiprocessing, profile each child process separately. threading.Thread can be profiled with logging.threading.current_thread().name
- Too many threads/processes: context switching or memory exhaustion.
- Chunksizes too small in
Pool.map: IPC overhead dominates. - Locks held too long: reduce scope of critical sections.
- Pickling overhead for large data: consider shared memory or array-based solutions.
Debugging deadlocks or hangs: use python -u to disable output buffering, then send a SIGQUIT (Ctrl+\) to get a traceback of all threads. For processes, use ps or strace to see where they're blocked. Use in your code to dump threads on crash.faulthandler.enable()
One production technique: wrap each thread's main loop in a try/except that logs the thread name and exception. This helps identify which thread is failing without a full core dump.
sys.getsizeof to estimate pickle size before deciding.Contexts and Start Methods: Why Your Multiprocessing Code Crashes on macOS but Works on Linux
You wrote a multiprocessing pipeline. Tests pass on your Ubuntu dev box. Deploy to a Mac or FreeBSD server and suddenly child processes hang or deadlock. That's because you ignored start methods.
Python's multiprocessing has three ways to spawn processes: fork, spawn, and forkserver. Fork copies the parent process memory as-is — fast but dangerous. Lock objects get duplicated in an unpredictable state. Spawn starts a fresh Python interpreter, safe but slower. forkserver is a hybrid: it forks from a clean process, not the main one.
On Linux, fork is the default. On macOS (since Python 3.8) and Windows, spawn is forced because fork is unreliable. If you rely on fork's shared-memory shortcuts, your code breaks cross-platform. The fix: explicitly set a start method early in your __main__ block using multiprocessing.set_start_method('spawn'). Then test your shared objects — locks, queues, events — under spawn semantics.
Production Trap: never import multiprocessing at module level in a library that might be used by a forking server. It locks in a start method before the user can choose.
Pipes and Queues: Don't Share Memory, Share Pickled Messages
Newcomers treat multiprocessing like threading on steroids. They try to share a dict or list between processes using a global variable. It works in dev. In production, they get stale reads, crashes, or silent corruption. Here's the rule: processes don't share memory. Python shares serialized copies via pipes and queues.
multiprocessing.Queue is a thread-safe, process-safe FIFO built on a pipe and locks. Use it to send work items from a producer to worker processes, or results back. multiprocessing.Pipe is lower-level — a duplex or simplex channel between two endpoints. Use Pipe when you have exactly two processes; Queue when you have N workers.
Both Queue and Pipe pickle every object you send. That means: 1) your objects must be picklable (no lambdas, no class instances with unpicklable attributes). 2) Serialization overhead matters — sending 10MB of data through a Queue is slower than writing to a shared file. 3) Queued items are consumed once. If you need broadcast, use multiprocessing.manager or a pub-sub pattern.
The takeaway: pretend shared memory doesn't exist. Design your process boundaries as message-passing interfaces. Test with large payloads early.
Barrier: The Synchronization Primitive Your Boss Expects You to Know
You don't want your workers running past each other's finish lines. That's what a Barrier does — it makes N threads or processes wait until all N have arrived before any proceed. Think of it as a mandatory code review gate before merging a PR.
Why this matters in production: you're distributing a dataset across 4 workers that each need to initialize some shared resource (open a DB connection, load a model, warm a cache). Without a Barrier, one worker finishes initialization and immediately starts processing data that isn't ready yet. With a Barrier, everyone waits — then proceeds simultaneously.
Barriers are your go-to when you need to split a multi-phase problem: train on chunks, barrier, evaluate on chunks, barrier, aggregate. They cost almost nothing. Use them. Don't spin up a custom busy-wait loop.
wait() never comes. Your entire pool hangs forever. Always count your workers or use a timeout: Barrier(4, timeout=30).Why This Isn't a Silly Example: Real Damage from Ignoring Synchronization
Every junior engineer has copy-pasted a threading example that uses a simple counter to demonstrate a race condition. And then they think "I fixed it with a lock." They're wrong — not about the lock, but about the problem space.
The real damage isn't a counter going off by one. It's a payment system processing the same transaction twice. It's a logging pipeline writing garbled JSON. It's a data loader corrupting a shared cache because two threads read and write to the same dict without synchronization. Those examples aren't "silly" — they're the exact bugs that get paged at 2 AM.
We do not teach race conditions with counters because we're lazy. We teach them because the mechanism is identical to the one that sinks a production database write. Fix the small dumb counter bug in training, and you won't tear your hair out fixing the real one on Friday night.
Similarities: Threading and Multiprocessing Share More Than You Think
Both threading and multiprocessing run tasks concurrently. Both create workers — threads or processes — that execute a target function. Both support daemon workers that die when the main program exits. Both provide Lock, Semaphore, Event, and Barrier for synchronization. Both spawn workers via similar APIs: Thread(target=fn) vs Process(target=fn). Both can use concurrent.futures executors that hide the underlying worker pool. Both suffer from race conditions when shared state is mutated without locks — the GIL does not protect you from inconsistent data, it only prevents parallel bytecode execution. Both require careful cleanup: unjoined daemon threads or orphaned processes cause resource leaks. The critical similarity is that neither gives you free lunch — you must reason about concurrent access, deadlocks, and starvation regardless of whether you pick threads or processes.
Differences: Processes Isolate Memory, Threads Share It
The root difference: threads share the same memory space; each process gets its own address space. This means threads can access the same variable directly (with locks needed for safety), while processes must use IPC (pipes, queues, shared memory) to communicate. Threads are lightweight — creating thousands is cheap; each process has high overhead (fork/exec, separate Python interpreter, GIL copy). Threads are limited by the GIL — only one thread executes Python bytecode at a time; processes bypass the GIL, achieving true CPU parallelism. Thread crashes kill the entire process; a process crash only kills itself. On macOS, multiprocessing defaults to 'spawn' (slower, safer) while Linux uses 'fork' (fast, but dangerous with locks). Threading scales for I/O-bound work (network, disk); multiprocessing scales for CPU-bound work (math, video encoding). Context switching between threads is nearly free; between processes it incurs OS scheduler overhead.
Race Condition in Threaded Web Scraper at Peak Traffic
multiprocessing.Queue or use a threading.Lock around all queue accesses. Implemented a Lock that wraps every pop and append, reducing race window to zero.- Never assume shared mutable state is safe in Python threads.
- Use thread-safe data structures like Queue, or serialize access with locks.
- Profile with and without locks to ensure you're not over-locking and killing performance.
- Always validate idempotency: if a URL is processed twice, downstream systems must handle deduplication.
multiprocessing.Pool and measure speedup. Profile with cProfile to confirm GIL contention.threading.Lock around all reads/writes. Use threading.Queue for producer-consumer.python -u script.py, then send SIGQUIT (Ctrl+\)) to get thread/process dumps. Check lock ordering. Use timeout on lock acquisitions.with Pool() as pool: as context manager. Limit number of processes to CPU count. Inspect zombie processes with ps aux | grep defunct.dill for complex cases.python -c "import threading; print(threading.enumerate())"gdb python python <PID> ; (gdb) bt fullLock.acquire(timeout=5)). Review lock ordering.Key takeaways
Common mistakes to avoid
7 patternsMemorising syntax before understanding the concept
Skipping practice and only reading theory
Using threads for CPU-bound tasks expecting parallel speedup
Not protecting shared mutable state with locks in threads
Creating too many processes (e.g., 1000 processes on 4-core machine)
os.cpu_count() using multiprocessing.Pool(processes=os.cpu_count())Forgetting to call `join()` on multiprocessing.Process
Assuming `multiprocessing.Queue` is as fast as `threading.Queue`
multiprocessing.Pipe for simple two-way communication, or shared memory for primitive types.Interview Questions on This Topic
Explain how the GIL affects Python threading. When would you still choose threads over processes?
Frequently Asked Questions
20+ years shipping production Python across data and backend systems. Drawn from code that ran under real load.
That's Python Libraries. Mark it forged?
13 min read · try the examples if you haven't