Python GIL — CPU Below 15% on 16 Cores
CPU utilization below 15% on a 16-core machine with 20 threads.
20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.
- The GIL is a mutex that prevents multiple native threads from executing Python bytecode at once.
- It protects CPython's reference counting from race conditions — one thread decrements a refcount, another thread uses the object before it's freed.
- CPU-bound threads are serialized by the GIL — you get zero parallelism no matter how many cores you have.
- I/O-bound threads still benefit because the GIL is released during blocking I/O calls.
- The GIL is not a language feature — it's specific to CPython. Jython and IronPython don't have it.
- Python 3.13 introduces an experimental no-GIL build (free-threaded) — but it's not production-ready yet.
Imagine a single microphone at a conference with 10 speakers. Every speaker wants to talk, but only one can hold the mic at a time — even if two of them could theoretically talk about completely different topics simultaneously. That microphone is Python's GIL. Your CPU might have 8 cores (8 potential simultaneous conversations), but the GIL forces every Python thread to queue up and take turns at that one mic, one at a time. The crowd (your CPU) sits mostly idle while speakers wait their turn.
If you've ever spun up a Python web scraper with 20 threads expecting a 20x speedup and instead got a 1.2x improvement, you've met the GIL — and you probably didn't know it. The Global Interpreter Lock is one of the most misunderstood performance constraints in any mainstream programming language. It's not a bug. It's not laziness. It's a deliberate architectural decision made in 1991 that solved a genuinely hard problem — and whose consequences we're still navigating in 2024.
CPython, the reference Python interpreter, manages memory using reference counting. Every Python object tracks how many references point to it, and when that count hits zero, the object gets deallocated. Reference counting is fast and simple, but it's also dangerously thread-unsafe. Without protection, two threads could simultaneously decrement the same reference count, race each other to zero, and cause a double-free — a memory corruption bug that would make your program crash in ways that are nearly impossible to debug. The GIL is the lock that prevents exactly this class of disaster. One lock to rule them all: only the thread holding the GIL can execute Python bytecode.
By the end of this article you'll understand exactly what the GIL protects and why, how to measure its impact on real code, when threading is still useful despite the GIL, when to reach for multiprocessing or asyncio instead, and — critically — how Python 3.13's experimental no-GIL build changes the picture. You'll walk away able to make informed concurrency decisions in production Python code and answer GIL questions in a senior engineering interview with confidence.
What is GIL — Global Interpreter Lock?
The Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode simultaneously. It's the reason Python threads don't give you parallelism for CPU-bound tasks — and the reason your multi-threaded web server can still handle concurrent requests without corrupting memory.
The GIL exists primarily to make CPython's memory management simple and fast. Without it, reference counting would require fine-grained locking on every object operation, which would be both slower and far more error-prone. The GIL is a pragmatic trade-off: it sacrifices parallel CPU throughput for simplicity, speed in single-threaded code, and safety in C extensions.
Why the GIL Exists: Reference Counting and Thread Safety
CPython's memory management is based on reference counting: every Python object has an ob_refcnt field that tracks how many references point to it. When a reference is created, ob_refcnt is incremented; when destroyed, decremented. When it hits zero, the object is deallocated immediately.
This is fast — but it's not thread-safe. Imagine two threads both hold references to the same object. Thread A decrements its reference (refcount goes from 2 to 1). Before Thread A can do anything else, Thread B also decrements (refcount goes from 1 to 0). Thread B sees zero and frees the memory. Then Thread A tries to use the object — use-after-free crash. Or both threads decrement simultaneously, the refcount goes to -1, and the object is never freed (memory leak).
The GIL prevents all of this by ensuring only one thread modifies any reference count at any moment. It's a coarse-grained lock — one lock for the entire interpreter — but it's simple and it works.
Alternative approaches exist: fine-grained locking per object (complex, overhead), atomic operations (limited), or garbage collection without reference counting (like PyPy or Jython). CPython chose the GIL, and it's been the default for 30+ years.
How the GIL Affects CPU-bound vs I/O-bound Tasks
This is the most practical distinction to understand. The GIL only protects Python bytecode execution. When a thread is waiting for I/O (disk, network, socket), it releases the GIL so another thread can run. That's why multi-threaded web servers and file readers work fine — the GIL is released during , recv(), send(), read(), write(), etc.sleep()
For CPU-bound tasks — number crunching, parsing, encryption — the thread never yields the GIL voluntarily. It runs until its bytecode slice expires (every 100 interpreter ticks in Python 2, every ~5ms in Python 3 via sys.setswitchinterval). Other threads must wait. If you have 8 CPU-bound threads on a 4-core machine, only one runs at a time — you get effectively single-core performance.
This is not a problem in many real-world Python workloads because the hot loops are often in C extensions (numpy, pandas, lxml) that release the GIL during computation. But pure Python CPU loops will be serialized.
GIL Lock/Release Flow Sequence Diagram
Understanding exactly when the GIL is acquired and released helps you predict whether threading will benefit your workload. The sequence diagram below shows two threads competing for the GIL: one performing a CPU-bound calculation and the other performing an I/O-bound operation (e.g., a network read). The CPU-bound thread holds the GIL continuously, while the I/O-bound thread releases it during the blocking call, allowing the other thread to run.
sys.setswitchinterval() to tune — but lowering it increases context-switch overhead. For I/O-heavy services, threading scales well because the GIL is released during waits.GIL Impact on I/O vs CPU Bound Tasks
The following table summarizes how the GIL affects each type of workload, what speedup you can expect from threading, and the recommended approach.
| Aspect | CPU-bound Task | I/O-bound Task |
|---|---|---|
| GIL effect | Held continuously → serial execution | Released during blocking calls → concurrency |
| Threading speedup | ~1x (no parallel gain) | Nearly linear up to thread count |
| CPU utilization | Only one core active | May use multiple cores when GIL is released |
| Example | Parsing HTML, mathematical loops, encryption | Reading files, making HTTP requests, waiting for DB queries |
| Python threading recommendation | Avoid — use multiprocessing | Good — works well |
| Alternative | multiprocessing or asyncio + subprocess | asyncio for high concurrency |
The critical insight: threading in Python is not universally useless. It's excellent for I/O-bound programs (web servers, scrapers, file watchers) where the GIL is released frequently. It's useless for CPU-bound pure Python loops.
Measuring GIL Contention in Practice
Before optimizing around the GIL, you must measure it. Blindly switching to multiprocessing can add copy overhead (pickle serialization) that kills performance for certain workloads.
Tools: - perf top -p <pid> shows where CPU time is spent. High percentage in _PyEval_EvalFrameDefault means GIL serialization. - /proc/<pid>/status shows voluntary_ctxt_switches — high values indicate thread contention. - strace -e trace=futex -p <pid> shows futex calls — GIL acquisition triggers FUTEX_WAIT when the lock is held by another thread. - py-spy (a sampling profiler) can show the call stack of all threads and highlight GIL blocking. - sys. in a signal handler can dump all thread stacks — look for threads stuck in _current_frames()take_gil.
Native GIL detection: Python 3.2+ exposes (default 5ms). You can lower it to make threads switch more often, but that increases overhead. Instead, measure the number of GIL acquisitions per second using sys.getswitchinterval()perf stat -e syscalls:sys_enter_futex.
Micro-benchmark pattern: Run a CPU-bound loop (pure Python) with 1 thread, then N threads. If time grows linearly with N, the GIL is fully serializing.
perf stat -e migrations to see threads moving across cores — GIL contention causes migrations.Beating the GIL: Threading, Multiprocessing, asyncio
Three main strategies to work around (or avoid) the GIL:
Multiprocessing — The most common approach. Each Python process has its own GIL, so N processes give you nearly Nx speedup for CPU-bound work. Use concurrent.futures.ProcessPoolExecutor or multiprocessing.Pool. Downside: overhead of serializing data between processes via pickle. If you pass large data structures, that can dominate runtime.
asyncio — Cooperative multitasking with a single thread. No GIL contention because there's only one thread. Great for I/O-bound workloads that spend most time waiting. Use await for all I/O. Downside: all code must be async — can't easily integrate blocking calls.
C Extensions with nogil — Write performance-critical code in Cython or C and release the GIL explicitly. The with nogil: block in Cython runs without the GIL, giving true parallelism. Downside: complexity, C interop.
Which to pick? - I/O-bound, many concurrent tasks → asyncio (single thread, no GIL fight) - CPU-bound, pure Python → multiprocessing - CPU-bound, mostly C extensions → threading may work (if ext releases GIL) - Mixed workload → multiprocessing for CPU parts, thread pool for I/O parts
The choice also depends on overhead tolerance. For small tasks (millisecond computation), multiprocessing overhead (process spawn, pickle) often outweighs parallel speedup. Profile before committing.
- asyncio: One person makes coffee for many. Great when waiting for water to boil (I/O).
- Threading: Many people share one machine. Only one can use it at a time. Fast if they spend most time away from it (I/O-bound).
- Multiprocessing: Each person has their own machine. Expensive to set up, but they never wait.
- Cython nogil: Hire a barista who works alone — never uses the coffee machine (releases GIL).
Python 3.13 'Free-threading' (No-GIL) Status
PEP 703 introduced an experimental free-threaded build of CPython 3.13 that removes the GIL entirely. Instead of a single global lock, it uses per-object reference counting with atomic operations and deferred memory deallocation. This allows true multi-core parallelism for pure Python CPU-bound code without switching to multiprocessing.
How to enable: Build CPython with --disable-gil or use a pre-built free-threaded package (e.g., python3.13t on conda-forge). At runtime, sys. returns _is_gil_enabled()False.
Current limitations: - Not production-ready — many C extensions assume the GIL protects them and will crash or corrupt data. - Single-threaded overhead of 5–15% due to atomic operations. - The Python C API has new requirements (e.g., PyThreadState_EnterTracing must be used correctly). - Only a subset of popular packages are compatible (numpy, pandas, pyarrow).
When to test it: If you have CPU-bound pure Python code that cannot be moved to C or multiprocessing (e.g., dynamic code generation, complex business logic), try free-threaded Python in a staging environment. But do not deploy to production until Python 3.14 or later when the feature stabilizes.
The free-threaded build is a glimpse of Python's future — eventually the GIL will be optional by default, and you'll get parallelism for free.
Python 3.13: The No-GIL Build (Free-Threaded Python)
Python 3.13 introduced an experimental build configuration called "free-threaded" that removes the GIL entirely. This is the result of PEP 703 ("Making the Global Interpreter Lock Optional") and years of work to make CPython's memory management thread-safe without a global lock.
How it works: Instead of one lock for all objects, CPython now uses per-object reference counting with atomic operations, plus a deferred reference counting approach for object deallocation. The GIL is eliminated.
Current status (2026): It's still experimental. Activate with --disable-gil at build time. Not all C extensions are compatible — those that assume the GIL protects them will crash. Known working: numpy, pandas, pyarrow. Known incompatible: many Cython extensions, lxml, some database drivers.
Performance: For pure Python CPU-bound code, free-threaded Python can achieve near-linear scaling on multi-core machines. But single-threaded performance is slightly worse (5-15% overhead) due to atomic operations in reference counting.
Production readiness: Not yet. Unless you control every C extension in your stack, stay with the GIL-py for now. But this is the future — Python will eventually make the GIL optional by default.
Why a Single Global Lock Instead of Per-Object Mutexes?
You just saw the race condition in list.append. Any sane C developer would slap a per-object mutex on it and move on. Python didn't. Why?
Performance. Pure and simple. In the early 90s, when Guido van Rossum wrote CPython, computers had one core. Threading was for I/O concurrency, not CPU parallelism. Adding a mutex to every single object operation — every attribute access, every dict lookup, every list append — would have killed single-threaded performance dead. Each mutex acquire/release costs tens of nanoseconds. That adds up fast when you're doing millions of operations per second.
The GIL is one lock, held for the duration of a bytecode instruction or a short C call. No lock contention in single-threaded code. No cascading lock overhead on every object. It was a pragmatic trade-off: sacrifice multi-core parallelism (which didn't exist yet) for single-threaded speed (which mattered).
And it worked. CPython became the reference implementation, and the GIL baked itself into the language's DNA. By the time multi-core CPUs became standard, the GIL was a core assumption in every C extension, every internal data structure, every thread-unsafe optimization. Removing it would mean rebuilding the whole interpreter.
How Python 3.13 Finally Breaks the Curse (Without Breaking Your Code)
The No-GIL build in Python 3.13 is not a flag you flip. It's a completely separate build of CPython — --disable-gil — that ships alongside the regular GIL'd interpreter. You opt in per interpreter binary, not per script. This avoids a thousand C extensions suddenly catching fire.
The trick? They didn't remove the GIL and hope for the best. They added per-object locks — exactly what the original CPython skipped. But now, those locks are fine-grained: one lock per PyObject, not one lock per interpreter. The list.append race condition? Now it's protected by a per-list mutex, acquired only when the internal state actually changes.
But here's the rub: every C extension ever written assumed the GIL protected it. numpy, pandas, scipy — they all call internal C APIs that mutate shared state without locking. The No-GIL build wraps every single C API call in a global lock equivalent to the old GIL. Result: extensions run, but with zero parallelism gains. You only get the speedup if your code is pure Python or written explicitly for free-threaded mode.
It's a bridge. You can compile your existing code with the No-GIL interpreter today, verify it doesn't crash, and then incrementally migrate hot paths to lock-free or per-object-locked patterns. No rewrite from scratch. That's the real engineering win.
Why fork() and the GIL Are a Toxic Combination
You're running a web server. You fork() to handle requests. Suddenly, your workers deadlock or crash. The root cause? The GIL doesn't protect you from POSIX fork() semantics.
When fork() executes, the child process inherits a copy of the parent's memory, including mutexes and locks. But the GIL is a mutex. If the parent held the GIL at the exact moment of fork, the child now has a locked GIL with no thread to unlock it. Any Python thread trying to acquire the GIL in the child process blocks forever. This is a classic deadlock that wastes hours of debugging.
The fix is brutal and simple: immediately after fork(), call PyOS_AfterFork_Child() (Python 3.7+) or reinitialize threading in the child. Even better: use multiprocessing with spawn (not fork) on macOS/Windows. For production Python, never assume fork()+threads works. It doesn't. Measure your process-start method, or you'll measure a production outage.
fork(), call 'threading._after_fork()' immediately in the child. But better: use 'multiprocessing.set_start_method("spawn")' to avoid the entire class of bugs.fork() a multi-threaded Python process without reinitializing the GIL. Use spawn-based multiprocessing.Mastering the Legacy API: sys.setswitchinterval() for GIL Control
Most devs treat the GIL as a black box. But Python exposes a legacy API that directly controls how often the GIL switches threads: sys.setswitchinterval(). This is your throttle for CPU-bound thread interleaving.
The switch interval (default 5ms in Python 3.2+) determines how long a thread holds the GIL before voluntarily yielding. Lower it to 1ms for more responsive interleaving (better for UI threads). Raise it to 100ms to reduce context-switch overhead in pure CPU work. This is not a hack—it's a documented tool. But it's global. Every thread in your process pays the cost.
Why does this matter in production? If you run CPU-bound tasks with threading, a high switchinterval starves I/O threads. A low one burns CPU on context switches. Profile your workload. For async or multiprocessing, this API is irrelevant—you've already beaten the GIL. But for legacy threaded systems, it's your only lever. Use it, or your production latency charts will mock you.
Why Hasn’t the GIL Been Removed Yet?
The GIL persists because removing it breaks C extensions that dominate Python’s ecosystem. Libraries like NumPy, pandas, and TensorFlow rely on the C API, which assumes single-threaded memory management via PyThreadState. A no-GIL build would require rewriting every C extension to use atomic operations or fine-grained locks—a years-long effort with no backward compatibility. Additionally, Python’s reference counting is fundamentally thread-unsafe without the GIL. Alternative garbage collectors (like tracing GC) exist, but they introduce unpredictable pauses, degrade cache performance, and increase memory overhead. The core dev team’s decision is pragmatic: ship stability now, chase parallelism later. Python 3.13’s free-threaded build exists as an experimental flag (--disable-gil), but the default build retains the GIL to protect the 90% of users who depend on C extensions. Removing the GIL isn’t a technical impossibility; it’s an ecosystem engineering challenge.
Asynchronous Notifications
The GIL creates a hidden bottleneck for asynchronous notifications—signals, wake-up events, or inter-thread messages that must cross the GIL boundary. When a thread sends a notification (e.g., threading.Event.set()), it forces the GIL to schedule the receiving thread. Under heavy concurrency, this scheduling overhead dominates: the GIL’s switch interval (default 5ms) means a notification can take 5ms+ to deliver even if the event is ready. This kills real-time responsiveness. For I/O-bound systems like web servers, the fix is to avoid threads entirely: use asyncio with cooperative multitasking, which sidesteps the GIL by never holding it during await. Alternatively, use zero-copy inter-thread queues (collections.deque with manual scheduling hints) to minimize GIL acquisition. Python 3.13’s free-threaded build removes notification latency entirely, but at the cost of slower atomic operations. Measure your notification latency with time.perf_counter_ns() before optimizing.
The 20-Thread Scraper That Crawled Like a Snail
- Threads are fine for I/O-bound work in Python, but useless for CPU-bound parallelism.
- Always profile CPU utilization before scaling threads.
- For CPU-bound workloads in CPython, use multiprocessing or asyncio + subprocess.
- The GIL is not going away soon — design your concurrency strategy around it.
perf top -p <pid> -Kcat /proc/<pid>/status | grep -i contextKey takeaways
Common mistakes to avoid
4 patternsAssuming threads give parallelism for all work
Using multiprocessing for tiny tasks
Believing asyncio removes the GIL completely
Ignoring C extension GIL release behavior
Interview Questions on This Topic
Explain what the Python GIL is and why it exists.
Frequently Asked Questions
20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.
That's Advanced Python. Mark it forged?
13 min read · try the examples if you haven't