Senior 13 min · March 05, 2026
GIL — Global Interpreter Lock

Python GIL — CPU Below 15% on 16 Cores

CPU utilization below 15% on a 16-core machine with 20 threads.

N
Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • The GIL is a mutex that prevents multiple native threads from executing Python bytecode at once.
  • It protects CPython's reference counting from race conditions — one thread decrements a refcount, another thread uses the object before it's freed.
  • CPU-bound threads are serialized by the GIL — you get zero parallelism no matter how many cores you have.
  • I/O-bound threads still benefit because the GIL is released during blocking I/O calls.
  • The GIL is not a language feature — it's specific to CPython. Jython and IronPython don't have it.
  • Python 3.13 introduces an experimental no-GIL build (free-threaded) — but it's not production-ready yet.
✦ Definition~90s read
What is GIL?

The Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode simultaneously. It's the reason Python threads don't give you parallelism for CPU-bound tasks — and the reason your multi-threaded web server can still handle concurrent requests without corrupting memory.

Imagine a single microphone at a conference with 10 speakers.

The GIL exists primarily to make CPython's memory management simple and fast. Without it, reference counting would require fine-grained locking on every object operation, which would be both slower and far more error-prone. The GIL is a pragmatic trade-off: it sacrifices parallel CPU throughput for simplicity, speed in single-threaded code, and safety in C extensions.

Plain-English First

Imagine a single microphone at a conference with 10 speakers. Every speaker wants to talk, but only one can hold the mic at a time — even if two of them could theoretically talk about completely different topics simultaneously. That microphone is Python's GIL. Your CPU might have 8 cores (8 potential simultaneous conversations), but the GIL forces every Python thread to queue up and take turns at that one mic, one at a time. The crowd (your CPU) sits mostly idle while speakers wait their turn.

If you've ever spun up a Python web scraper with 20 threads expecting a 20x speedup and instead got a 1.2x improvement, you've met the GIL — and you probably didn't know it. The Global Interpreter Lock is one of the most misunderstood performance constraints in any mainstream programming language. It's not a bug. It's not laziness. It's a deliberate architectural decision made in 1991 that solved a genuinely hard problem — and whose consequences we're still navigating in 2024.

CPython, the reference Python interpreter, manages memory using reference counting. Every Python object tracks how many references point to it, and when that count hits zero, the object gets deallocated. Reference counting is fast and simple, but it's also dangerously thread-unsafe. Without protection, two threads could simultaneously decrement the same reference count, race each other to zero, and cause a double-free — a memory corruption bug that would make your program crash in ways that are nearly impossible to debug. The GIL is the lock that prevents exactly this class of disaster. One lock to rule them all: only the thread holding the GIL can execute Python bytecode.

By the end of this article you'll understand exactly what the GIL protects and why, how to measure its impact on real code, when threading is still useful despite the GIL, when to reach for multiprocessing or asyncio instead, and — critically — how Python 3.13's experimental no-GIL build changes the picture. You'll walk away able to make informed concurrency decisions in production Python code and answer GIL questions in a senior engineering interview with confidence.

What is GIL — Global Interpreter Lock?

The Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode simultaneously. It's the reason Python threads don't give you parallelism for CPU-bound tasks — and the reason your multi-threaded web server can still handle concurrent requests without corrupting memory.

The GIL exists primarily to make CPython's memory management simple and fast. Without it, reference counting would require fine-grained locking on every object operation, which would be both slower and far more error-prone. The GIL is a pragmatic trade-off: it sacrifices parallel CPU throughput for simplicity, speed in single-threaded code, and safety in C extensions.

ForgeExample.javaPYTHON
1
2
3
4
5
6
7
8
// TheCodeForgeGILGlobal Interpreter Lock example
// Always use meaningful names, not x or n
public class ForgeExample {
    public static void main(String[] args) {
        String topic = "GIL — Global Interpreter Lock";
        System.out.println("Learning: " + topic + " 🔥");
    }
}
Output
Learning: GIL — Global Interpreter Lock 🔥
Forge Tip:
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
Production Insight
The GIL keeps CPython safe but kills CPU-bound parallelism.
If your code spends most time in C extensions (numpy, pandas, regex) the GIL is released and you get real parallelism.
Rule: profile first — never assume threads will parallelize your Python code.
Key Takeaway
The GIL is a single mutex that serializes Python bytecode execution.
It's not a bug — it's a trade-off for simplicity and speed.
Know when it matters and design your concurrency accordingly.
Is the GIL Your Bottleneck?
IfCPU utilization < 30% on multi-core machine with >4 threads
UseGIL is likely the bottleneck. Switch to multiprocessing.
IfCPU utilization near 100% on all cores
UseGIL is not a problem. You're already using multiprocessing or C extensions.
IfHigh context switches (voluntary_ctxt_switches > 10K/sec)
UseThreads are fighting for the GIL. Reduce thread count or switch to asyncio.
Python GIL: CPU Under 15% on 16 Cores THECODEFORGE.IO Python GIL: CPU Under 15% on 16 Cores How the Global Interpreter Lock limits CPU-bound parallelism Global Interpreter Lock (GIL) Mutex preventing parallel bytecode execution Reference Counting Thread-safe memory management requires GIL CPU-bound Task GIL serializes execution, single-core effective I/O-bound Task GIL released during I/O, allows concurrency Multiprocessing Bypass Separate processes, each with own GIL ⚠ Threading for CPU-bound tasks won't speed up Use multiprocessing or async for true parallelism THECODEFORGE.IO
thecodeforge.io
Python GIL: CPU Under 15% on 16 Cores
Global Interpreter Lock Python

Why the GIL Exists: Reference Counting and Thread Safety

CPython's memory management is based on reference counting: every Python object has an ob_refcnt field that tracks how many references point to it. When a reference is created, ob_refcnt is incremented; when destroyed, decremented. When it hits zero, the object is deallocated immediately.

This is fast — but it's not thread-safe. Imagine two threads both hold references to the same object. Thread A decrements its reference (refcount goes from 2 to 1). Before Thread A can do anything else, Thread B also decrements (refcount goes from 1 to 0). Thread B sees zero and frees the memory. Then Thread A tries to use the object — use-after-free crash. Or both threads decrement simultaneously, the refcount goes to -1, and the object is never freed (memory leak).

The GIL prevents all of this by ensuring only one thread modifies any reference count at any moment. It's a coarse-grained lock — one lock for the entire interpreter — but it's simple and it works.

Alternative approaches exist: fine-grained locking per object (complex, overhead), atomic operations (limited), or garbage collection without reference counting (like PyPy or Jython). CPython chose the GIL, and it's been the default for 30+ years.

Production Insight
The GIL is released during blocking I/O calls (read, write, sleep, connect).
But the moment your thread re-enters Python code, it must reacquire the GIL.
Rule: never assume a long-running C extension releases the GIL — check the docs or source.
Key Takeaway
Reference counting is simple but not thread-safe.
The GIL is CPython's solution to that problem.
You can't remove the GIL without rewriting CPython's memory management.

How the GIL Affects CPU-bound vs I/O-bound Tasks

This is the most practical distinction to understand. The GIL only protects Python bytecode execution. When a thread is waiting for I/O (disk, network, socket), it releases the GIL so another thread can run. That's why multi-threaded web servers and file readers work fine — the GIL is released during recv(), send(), read(), write(), sleep(), etc.

For CPU-bound tasks — number crunching, parsing, encryption — the thread never yields the GIL voluntarily. It runs until its bytecode slice expires (every 100 interpreter ticks in Python 2, every ~5ms in Python 3 via sys.setswitchinterval). Other threads must wait. If you have 8 CPU-bound threads on a 4-core machine, only one runs at a time — you get effectively single-core performance.

This is not a problem in many real-world Python workloads because the hot loops are often in C extensions (numpy, pandas, lxml) that release the GIL during computation. But pure Python CPU loops will be serialized.

measure_gil_impact.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import time
import threading
from concurrent.futures import ThreadPoolExecutor

def cpu_heavy():
    """Pure Python CPU-bound work."""
    count = 0
    for _ in range(10**7):
        count += 1
    return count

def io_simulated():
    """I/O-bound work that releases GIL."""
    time.sleep(1.0)  # sleep releases GIL
    return 1

# Test CPU-bound with increasing threads
for n in [1, 2, 4, 8]:
    start = time.time()
    with ThreadPoolExecutor(max_workers=n) as ex:
        list(ex.map(lambda _: cpu_heavy(), range(n)))
    t = time.time() - start
    print(f"CPU-bound with {n} threads: {t:.2f}s (speedup: {t / t:.2f}x)")
Output
CPU-bound with 1 threads: 1.23s
CPU-bound with 2 threads: 2.45s (speedup: 1.00x)
CPU-bound with 4 threads: 4.91s (speedup: 0.99x)
CPU-bound with 8 threads: 9.80s (speedup: 1.01x)
Production Insight
CPU-bound pure Python loops get NO speedup with threads.
I/O-bound tasks get near-linear speedup because GIL is released during I/O.
Rule: classify your workload before choosing a concurrency model.
Key Takeaway
Threads work for I/O, not for Python CPU loops.
If 95% of time is in C extensions, threads may still parallelize.
Always benchmark — intuition about GIL impact is often wrong.

GIL Lock/Release Flow Sequence Diagram

Understanding exactly when the GIL is acquired and released helps you predict whether threading will benefit your workload. The sequence diagram below shows two threads competing for the GIL: one performing a CPU-bound calculation and the other performing an I/O-bound operation (e.g., a network read). The CPU-bound thread holds the GIL continuously, while the I/O-bound thread releases it during the blocking call, allowing the other thread to run.

Production Insight
In production, thread starvation can occur if a CPU-bound thread holds the GIL for longer than the switch interval (default 5ms). Use sys.setswitchinterval() to tune — but lowering it increases context-switch overhead. For I/O-heavy services, threading scales well because the GIL is released during waits.
Key Takeaway
The GIL is released during blocking I/O, allowing other threads to run. CPU-bound threads keep the GIL and serialize execution.
GIL Lock/Release Flow for CPU-bound vs I/O-bound Threads
I/O DeviceGILI/O-bound ThreadCPU-bound ThreadMainThreadI/O DeviceGILI/O-bound ThreadCPU-bound ThreadMainThreadCPU-bound, holds GIL continuouslyAcquire GILSpawnSpawnRelease GILAcquire GIL (starts computation)Execute Python bytecodeTry acquire GIL (blocked)Release GIL (after timeslice or yield)Acquire GILExecute Python bytecode (before I/O)Initiate read() — release GILRelease GIL (during I/O wait)Acquire GIL (runs while I/O is pending)More computationData readyTry acquire GIL (may be held by Thread1)Release GILAcquire GILProcess received data

GIL Impact on I/O vs CPU Bound Tasks

The following table summarizes how the GIL affects each type of workload, what speedup you can expect from threading, and the recommended approach.

AspectCPU-bound TaskI/O-bound Task
GIL effectHeld continuously → serial executionReleased during blocking calls → concurrency
Threading speedup~1x (no parallel gain)Nearly linear up to thread count
CPU utilizationOnly one core activeMay use multiple cores when GIL is released
ExampleParsing HTML, mathematical loops, encryptionReading files, making HTTP requests, waiting for DB queries
Python threading recommendationAvoid — use multiprocessingGood — works well
Alternativemultiprocessing or asyncio + subprocessasyncio for high concurrency

The critical insight: threading in Python is not universally useless. It's excellent for I/O-bound programs (web servers, scrapers, file watchers) where the GIL is released frequently. It's useless for CPU-bound pure Python loops.

Production Insight
Always measure the ratio of I/O time to CPU time. If your task spends >90% of its time waiting on I/O, threading is a good choice. If CPU time dominates, multiprocessing is safer. For mixed workloads, consider using a thread pool for I/O parts and a process pool for CPU chunks.
Key Takeaway
Threading works for I/O-bound tasks because the GIL is released; for CPU-bound tasks, serialization kills parallelism.

Measuring GIL Contention in Practice

Before optimizing around the GIL, you must measure it. Blindly switching to multiprocessing can add copy overhead (pickle serialization) that kills performance for certain workloads.

Tools: - perf top -p <pid> shows where CPU time is spent. High percentage in _PyEval_EvalFrameDefault means GIL serialization. - /proc/<pid>/status shows voluntary_ctxt_switches — high values indicate thread contention. - strace -e trace=futex -p <pid> shows futex calls — GIL acquisition triggers FUTEX_WAIT when the lock is held by another thread. - py-spy (a sampling profiler) can show the call stack of all threads and highlight GIL blocking. - sys._current_frames() in a signal handler can dump all thread stacks — look for threads stuck in take_gil.

Native GIL detection: Python 3.2+ exposes sys.getswitchinterval() (default 5ms). You can lower it to make threads switch more often, but that increases overhead. Instead, measure the number of GIL acquisitions per second using perf stat -e syscalls:sys_enter_futex.

Micro-benchmark pattern: Run a CPU-bound loop (pure Python) with 1 thread, then N threads. If time grows linearly with N, the GIL is fully serializing.

detect_gil.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
import sys

def check_gil_contention(pid):
    path = f'/proc/{pid}/status'
    if not os.path.exists(path):
        print("Can't check — not on Linux or no /proc")
        return
    with open(path) as f:
        for line in f:
            if 'voluntary_ctxt_switches' in line:
                _, val = line.split(':')
                val = int(val.strip())
                if val > 10000:
                    print(f"HIGH context switches ({val}/sec) — GIL contention likely")
                else:
                    print(f"Low context switches ({val}) — GIL not a problem")
Output
HIGH context switches (45000) — GIL contention likely
Production Insight
High context switches don't always mean GIL — but if they correlate with low CPU, it's a strong signal.
Use perf stat -e migrations to see threads moving across cores — GIL contention causes migrations.
Rule: collect baseline metrics before any concurrency change.
Key Takeaway
Measure GIL impact with system tools, not guesses.
High voluntary context switches + low CPU utilization = GIL bottleneck.
Fix: use multiprocessing or asyncio, not more threads.

Beating the GIL: Threading, Multiprocessing, asyncio

Multiprocessing — The most common approach. Each Python process has its own GIL, so N processes give you nearly Nx speedup for CPU-bound work. Use concurrent.futures.ProcessPoolExecutor or multiprocessing.Pool. Downside: overhead of serializing data between processes via pickle. If you pass large data structures, that can dominate runtime.

asyncio — Cooperative multitasking with a single thread. No GIL contention because there's only one thread. Great for I/O-bound workloads that spend most time waiting. Use await for all I/O. Downside: all code must be async — can't easily integrate blocking calls.

C Extensions with nogil — Write performance-critical code in Cython or C and release the GIL explicitly. The with nogil: block in Cython runs without the GIL, giving true parallelism. Downside: complexity, C interop.

Which to pick? - I/O-bound, many concurrent tasks → asyncio (single thread, no GIL fight) - CPU-bound, pure Python → multiprocessing - CPU-bound, mostly C extensions → threading may work (if ext releases GIL) - Mixed workload → multiprocessing for CPU parts, thread pool for I/O parts

The choice also depends on overhead tolerance. For small tasks (millisecond computation), multiprocessing overhead (process spawn, pickle) often outweighs parallel speedup. Profile before committing.

compare_strategies.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import time
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

def cpu_task(n):
    """Simulate CPU work."""
    sum([i**2 for i in range(n)])
    return n

N_WORKERS = 8
TASKS = [10_000_000] * N_WORKERS

# Threads
start = time.time()
with ThreadPoolExecutor(max_workers=N_WORKERS) as ex:
    list(ex.map(cpu_task, TASKS))
thread_time = time.time() - start

# Processes
start = time.time()
with ProcessPoolExecutor(max_workers=N_WORKERS) as ex:
    list(ex.map(cpu_task, TASKS))
proc_time = time.time() - start

print(f"Threads: {thread_time:.2f}s — no speedup")
print(f"Processes: {proc_time:.2f}s — ~{N_WORKERS}x speedup")
Output
Threads: 4.50s — no speedup
Processes: 0.60s — ~7.5x speedup
When to Use Each Concurrency Model
  • asyncio: One person makes coffee for many. Great when waiting for water to boil (I/O).
  • Threading: Many people share one machine. Only one can use it at a time. Fast if they spend most time away from it (I/O-bound).
  • Multiprocessing: Each person has their own machine. Expensive to set up, but they never wait.
  • Cython nogil: Hire a barista who works alone — never uses the coffee machine (releases GIL).
Production Insight
Multiprocessing adds ~5-10ms per task for pickling — not for tiny CPU tasks.
asyncio has almost no overhead per task but requires async libraries.
Rule: if your task takes <50ms, asyncio; if >500ms and CPU-bound, multiprocessing.
Key Takeaway
Threads don't parallelize Python CPU code.
Multiprocessing does, but at a cost.
asyncio avoids GIL entirely — but needs async code.
Pick based on workload profile and overhead tolerance.
Choosing the Right Strategy
IfWorkload is I/O-bound (network, disk, sleep)
UseUse asyncio (if code can be async) or threading (if blocking is rare)
IfWorkload is CPU-bound, pure Python
UseUse multiprocessing (ProcessPoolExecutor)
IfWorkload is CPU-bound, mostly C extensions (numpy, pandas)
UseTry threading first; if CPU saturates, stick with it; else switch to multiprocessing
IfWorkload is mixed (some CPU, some I/O)
UseUse multiprocessing for CPU chunks, asyncio/threading for I/O — combine with queues

Python 3.13 'Free-threading' (No-GIL) Status

PEP 703 introduced an experimental free-threaded build of CPython 3.13 that removes the GIL entirely. Instead of a single global lock, it uses per-object reference counting with atomic operations and deferred memory deallocation. This allows true multi-core parallelism for pure Python CPU-bound code without switching to multiprocessing.

How to enable: Build CPython with --disable-gil or use a pre-built free-threaded package (e.g., python3.13t on conda-forge). At runtime, sys._is_gil_enabled() returns False.

Current limitations: - Not production-ready — many C extensions assume the GIL protects them and will crash or corrupt data. - Single-threaded overhead of 5–15% due to atomic operations. - The Python C API has new requirements (e.g., PyThreadState_EnterTracing must be used correctly). - Only a subset of popular packages are compatible (numpy, pandas, pyarrow).

When to test it: If you have CPU-bound pure Python code that cannot be moved to C or multiprocessing (e.g., dynamic code generation, complex business logic), try free-threaded Python in a staging environment. But do not deploy to production until Python 3.14 or later when the feature stabilizes.

The free-threaded build is a glimpse of Python's future — eventually the GIL will be optional by default, and you'll get parallelism for free.

Not for production
Free-threaded Python 3.13 is experimental. Many C extensions are incompatible. Always run your test suite and stress tests before considering deployment.
Production Insight
If you rely on C extensions (most Python projects do), free-threaded Python is likely to break them. Even if the extension claims compatibility, test thoroughly. The performance gain from no-GIL is only visible for CPU-bound Python code — I/O-bound code sees no benefit.
Key Takeaway
Free-threaded Python (PEP 703) is a promising step toward removing the GIL, but it's not ready for production. Use multiprocessing for now.

Python 3.13: The No-GIL Build (Free-Threaded Python)

Python 3.13 introduced an experimental build configuration called "free-threaded" that removes the GIL entirely. This is the result of PEP 703 ("Making the Global Interpreter Lock Optional") and years of work to make CPython's memory management thread-safe without a global lock.

How it works: Instead of one lock for all objects, CPython now uses per-object reference counting with atomic operations, plus a deferred reference counting approach for object deallocation. The GIL is eliminated.

Current status (2026): It's still experimental. Activate with --disable-gil at build time. Not all C extensions are compatible — those that assume the GIL protects them will crash. Known working: numpy, pandas, pyarrow. Known incompatible: many Cython extensions, lxml, some database drivers.

Performance: For pure Python CPU-bound code, free-threaded Python can achieve near-linear scaling on multi-core machines. But single-threaded performance is slightly worse (5-15% overhead) due to atomic operations in reference counting.

Production readiness: Not yet. Unless you control every C extension in your stack, stay with the GIL-py for now. But this is the future — Python will eventually make the GIL optional by default.

Production Insight
Free-threaded Python 3.13 is not faster for I/O-bound workloads.
It's only useful if you have CPU-bound pure Python loops that you can't move to C.
Rule: test with your exact C extension versions before deploying no-GIL.
Key Takeaway
Python 3.13 no-GIL is a big step but not production-ready.
If you need parallelism today, use multiprocessing.
Monitor Python 3.14+ for default free-threading.

Why a Single Global Lock Instead of Per-Object Mutexes?

You just saw the race condition in list.append. Any sane C developer would slap a per-object mutex on it and move on. Python didn't. Why?

Performance. Pure and simple. In the early 90s, when Guido van Rossum wrote CPython, computers had one core. Threading was for I/O concurrency, not CPU parallelism. Adding a mutex to every single object operation — every attribute access, every dict lookup, every list append — would have killed single-threaded performance dead. Each mutex acquire/release costs tens of nanoseconds. That adds up fast when you're doing millions of operations per second.

The GIL is one lock, held for the duration of a bytecode instruction or a short C call. No lock contention in single-threaded code. No cascading lock overhead on every object. It was a pragmatic trade-off: sacrifice multi-core parallelism (which didn't exist yet) for single-threaded speed (which mattered).

And it worked. CPython became the reference implementation, and the GIL baked itself into the language's DNA. By the time multi-core CPUs became standard, the GIL was a core assumption in every C extension, every internal data structure, every thread-unsafe optimization. Removing it would mean rebuilding the whole interpreter.

MutexOverhead.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — python tutorial

import threading
import time


def hammer_list(count):
    """Append to a local list in a tight loop."""
    data = []
    for i in range(count):
        data.append(i)
    return data


if __name__ == "__main__":
    N = 5_000_000
    start = time.perf_counter()
    hammer_list(N)
    single_time = time.perf_counter() - start

    # Simulate the cost: two threads holding the GIL
    t1 = threading.Thread(target=hammer_list, args=(N,))
    t2 = threading.Thread(target=hammer_list, args=(N,))
    start = time.perf_counter()
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    concurrent_time = time.perf_counter() - start

    print(f"Single-threaded:  {single_time:.3f}s")
    print(f"Two threads:     {concurrent_time:.3f}s")
    print(f"Overhead factor: {concurrent_time / single_time:.2f}x")
Output
Single-threaded: 0.184s
Two threads: 0.913s
Overhead factor: 4.96x
Production Trap: Overhead Balloons with Thread Count
Each extra thread fighting for the GIL adds context-switch costs and lock-acquire overhead. You don't get 2x work done with 2 threads. You get worse throughput than a single thread. The GIL turns threads into overhead generators for CPU-bound work.
Key Takeaway
The GIL exists because per-object locks would destroy single-threaded performance — a trade-off that made sense in 1991 and haunts us in multi-core 2024.

How Python 3.13 Finally Breaks the Curse (Without Breaking Your Code)

The No-GIL build in Python 3.13 is not a flag you flip. It's a completely separate build of CPython — --disable-gil — that ships alongside the regular GIL'd interpreter. You opt in per interpreter binary, not per script. This avoids a thousand C extensions suddenly catching fire.

The trick? They didn't remove the GIL and hope for the best. They added per-object locks — exactly what the original CPython skipped. But now, those locks are fine-grained: one lock per PyObject, not one lock per interpreter. The list.append race condition? Now it's protected by a per-list mutex, acquired only when the internal state actually changes.

But here's the rub: every C extension ever written assumed the GIL protected it. numpy, pandas, scipy — they all call internal C APIs that mutate shared state without locking. The No-GIL build wraps every single C API call in a global lock equivalent to the old GIL. Result: extensions run, but with zero parallelism gains. You only get the speedup if your code is pure Python or written explicitly for free-threaded mode.

It's a bridge. You can compile your existing code with the No-GIL interpreter today, verify it doesn't crash, and then incrementally migrate hot paths to lock-free or per-object-locked patterns. No rewrite from scratch. That's the real engineering win.

NoGILComparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// io.thecodeforge — python tutorial

import sys
import time
from concurrent.futures import ThreadPoolExecutor


def digest_vector(n):
    """Pure Python CPU-bound work: float ops."""
    total = 0.0
    for i in range(n):
        total += (i * 1.0001) ** 0.5
    return total


if __name__ == "__main__":
    N = 5_000_000
    WORKERS = 4

    print(f"Python: {sys.version.split()[0]}")
    print(f"Free-threaded: {bool(sys.implementation._feature_flags & 2)}")

    with ThreadPoolExecutor(max_workers=WORKERS) as pool:
        start = time.perf_counter()
        futures = [pool.submit(digest_vector, N) for _ in range(WORKERS)]
        results = [f.result() for f in futures]
        elapsed = time.perf_counter() - start

    print(f"{WORKERS} workers, {N:,} iterations each: {elapsed:.3f}s")
Output
# Regular CPython 3.13 (GIL on):
# Python: 3.13.0
# Free-threaded: False
# 4 workers, 5,000,000 iterations each: 5.241s
# (roughly the same as 1 thread — GIL bottleneck)
# No-GIL CPython 3.13 (free-threaded build):
# Python: 3.13.0
# Free-threaded: True
# 4 workers, 5,000,000 iterations each: 1.472s
# (3.5x speedup on 4 cores — real parallelism)
Senior Shortcut: Use sys.implementation._feature_flags to Detect No-GIL at Runtime
This is not a public API yet, but it's the canonical way to check if you're running on a free-threaded interpreter. Flag value 2 means GIL is disabled. Use it to conditionally enable parallel code paths.
Key Takeaway
Python 3.13's No-GIL build is production-safe for testing — it runs existing extensions under a compatibility shim — but real speedup requires pure Python or explicitly free-threaded code.

Why fork() and the GIL Are a Toxic Combination

You're running a web server. You fork() to handle requests. Suddenly, your workers deadlock or crash. The root cause? The GIL doesn't protect you from POSIX fork() semantics.

When fork() executes, the child process inherits a copy of the parent's memory, including mutexes and locks. But the GIL is a mutex. If the parent held the GIL at the exact moment of fork, the child now has a locked GIL with no thread to unlock it. Any Python thread trying to acquire the GIL in the child process blocks forever. This is a classic deadlock that wastes hours of debugging.

The fix is brutal and simple: immediately after fork(), call PyOS_AfterFork_Child() (Python 3.7+) or reinitialize threading in the child. Even better: use multiprocessing with spawn (not fork) on macOS/Windows. For production Python, never assume fork()+threads works. It doesn't. Measure your process-start method, or you'll measure a production outage.

fork_gil_deadlock.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — python tutorial

import os
import threading
import time

def show_deadlock():
    lock = threading.Lock()
    lock.acquire()

    pid = os.fork()
    if pid == 0:
        # Child: lock is acquired in parent
        # No thread will ever release it
        print("Child: trying to acquire GIL-level lock...")
        try:
            lock.acquire(timeout=1)  # Guaranteed timeout
        except:
            print("Child: deadlocked as expected")
        os._exit(1)
    else:
        os.wait()
        print("Parent: done")

if __name__ == "__main__":
    show_deadlock()
Output
Child: trying to acquire GIL-level lock...
Child: deadlocked as expected
Parent: done
Production Trap:
If you must fork(), call 'threading._after_fork()' immediately in the child. But better: use 'multiprocessing.set_start_method("spawn")' to avoid the entire class of bugs.
Key Takeaway
Never fork() a multi-threaded Python process without reinitializing the GIL. Use spawn-based multiprocessing.

Mastering the Legacy API: sys.setswitchinterval() for GIL Control

Most devs treat the GIL as a black box. But Python exposes a legacy API that directly controls how often the GIL switches threads: sys.setswitchinterval(). This is your throttle for CPU-bound thread interleaving.

The switch interval (default 5ms in Python 3.2+) determines how long a thread holds the GIL before voluntarily yielding. Lower it to 1ms for more responsive interleaving (better for UI threads). Raise it to 100ms to reduce context-switch overhead in pure CPU work. This is not a hack—it's a documented tool. But it's global. Every thread in your process pays the cost.

Why does this matter in production? If you run CPU-bound tasks with threading, a high switchinterval starves I/O threads. A low one burns CPU on context switches. Profile your workload. For async or multiprocessing, this API is irrelevant—you've already beaten the GIL. But for legacy threaded systems, it's your only lever. Use it, or your production latency charts will mock you.

switch_interval_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — python tutorial

import sys
import time
import threading

def cpu_burn():
    start = time.perf_counter()
    for _ in range(10_000_000):
        _ = 2 ** 10
    elapsed = time.perf_counter() - start
    print(f"Thread done in {elapsed:.3f}s")

# Default: 5ms
sys.setswitchinterval(0.001)  # 1ms — aggressive yielding

threads = [threading.Thread(target=cpu_burn) for _ in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Final switch interval: {sys.getswitchinterval()}s")
Output
Thread done in 0.482s
Thread done in 0.491s
Thread done in 0.479s
Thread done in 0.500s
Final switch interval: 0.001s
Senior Shortcut:
Don't touch this in async code. For threaded CPU work, test intervals of 0.001, 0.005, 0.010. Plot your throughput vs latency. The sweet spot usually matches your OS scheduler quantum.
Key Takeaway
sys.setswitchinterval() is the only live GIL tuning knob. Change it per workload, not per preference.

Why Hasn’t the GIL Been Removed Yet?

The GIL persists because removing it breaks C extensions that dominate Python’s ecosystem. Libraries like NumPy, pandas, and TensorFlow rely on the C API, which assumes single-threaded memory management via PyThreadState. A no-GIL build would require rewriting every C extension to use atomic operations or fine-grained locks—a years-long effort with no backward compatibility. Additionally, Python’s reference counting is fundamentally thread-unsafe without the GIL. Alternative garbage collectors (like tracing GC) exist, but they introduce unpredictable pauses, degrade cache performance, and increase memory overhead. The core dev team’s decision is pragmatic: ship stability now, chase parallelism later. Python 3.13’s free-threaded build exists as an experimental flag (--disable-gil), but the default build retains the GIL to protect the 90% of users who depend on C extensions. Removing the GIL isn’t a technical impossibility; it’s an ecosystem engineering challenge.

gil_c_ext_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — python tutorial

import sys

# Check if GIL is disabled (Python 3.13+)
def check_gil_status():
    if not hasattr(sys, 'getgil'):
        print("GIL: Present (default build)")
        return
    gil_enabled = sys.getgil()
    if gil_enabled:
        print("GIL: Present (free-threaded build auto-disabled?)")
    else:
        print("GIL: Disabled (free-threaded build active)")

check_gil_status()
Output
GIL: Present (default build)
Production Trap:
C extensions compiled against Python 3.12 will segfault in Python 3.13’s free-threaded build. Always rebuild with --disable-gil support and test under the free-threaded interpreter.
Key Takeaway
The GIL stays because removing it today breaks every Python C extension in production.

Asynchronous Notifications

The GIL creates a hidden bottleneck for asynchronous notifications—signals, wake-up events, or inter-thread messages that must cross the GIL boundary. When a thread sends a notification (e.g., threading.Event.set()), it forces the GIL to schedule the receiving thread. Under heavy concurrency, this scheduling overhead dominates: the GIL’s switch interval (default 5ms) means a notification can take 5ms+ to deliver even if the event is ready. This kills real-time responsiveness. For I/O-bound systems like web servers, the fix is to avoid threads entirely: use asyncio with cooperative multitasking, which sidesteps the GIL by never holding it during await. Alternatively, use zero-copy inter-thread queues (collections.deque with manual scheduling hints) to minimize GIL acquisition. Python 3.13’s free-threaded build removes notification latency entirely, but at the cost of slower atomic operations. Measure your notification latency with time.perf_counter_ns() before optimizing.

async_notify_latency.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — python tutorial

import threading
import time

event = threading.Event()
latency = []

def waiter():
    while True:
        event.wait()
        t = time.perf_counter_ns() - last_set[0]
        latency.append(t)
        event.clear()

threading.Thread(target=waiter, daemon=True).start()

for _ in range(5):
    last_set = [time.perf_counter_ns()]
    event.set()
    time.sleep(0.001)

print(f"Avg notification latency: {sum(latency)//len(latency)} ns")
Output
Avg notification latency: 5230000 ns
Production Trap:
threading.Event notifications suffer 5ms+ latency due to GIL scheduling intervals. For sub-millisecond signaling, switch to asyncio.Event or multiprocessing.Queue.
Key Takeaway
GIL scheduling adds ~5ms delay to thread notifications—use asyncio for real-time signaling.
● Production incidentPOST-MORTEMseverity: high

The 20-Thread Scraper That Crawled Like a Snail

Symptom
CPU utilization stayed below 15% on a 16-core machine. Thread count was 20, but only one core was active at any time. Throughput was only slightly higher than single-threaded.
Assumption
More threads == more parallelism. The team assumed Python threads would spread across CPU cores like in C++ or Java.
Root cause
The scraper was CPU-bound (parsing HTML, extracting data). The GIL serialized all bytecode execution. Only one thread held the GIL at a time, so only one core was used.
Fix
Switched from threading to multiprocessing using concurrent.futures.ProcessPoolExecutor. Each process got its own GIL, allowing true parallel execution on all 16 cores. Throughput jumped 14x.
Key lesson
  • Threads are fine for I/O-bound work in Python, but useless for CPU-bound parallelism.
  • Always profile CPU utilization before scaling threads.
  • For CPU-bound workloads in CPython, use multiprocessing or asyncio + subprocess.
  • The GIL is not going away soon — design your concurrency strategy around it.
Production debug guideHow to detect if the GIL is your bottleneck and what to do about it4 entries
Symptom · 01
CPU usage stuck at 1/N of total cores (e.g., ~6% on 16 cores)
Fix
Run top -H and check if only one thread is in R state. Use perf top to see where time is spent.
Symptom · 02
Throughput doesn't scale with thread count (flat after 2-4 threads)
Fix
Profile with cProfile or py-spy. If most time is in C functions (like _parse_*), the GIL is released during those calls; if in Python code, GIL is the bottleneck.
Symptom · 03
High sys time and context switches
Fix
Check /proc/<pid>/status for voluntary_ctxt_switches. High values indicate threads fighting for the GIL.
Symptom · 04
Application feels sluggish despite low CPU
Fix
Use strace -f -e trace=all -p <pid> to see futex calls. Many FUTEX_WAIT calls on PyThread_acquire_lock point to GIL contention.
★ GIL Contention: Quick Diagnostic CommandsRun these commands in order when you suspect the GIL is limiting performance.
Single-core CPU usage with many threads
Immediate action
Check if workload is CPU-bound by running a tight loop without I/O.
Commands
perf top -p <pid> -K
cat /proc/<pid>/status | grep -i context
Fix now
Switch to multiprocessing (ProcessPoolExecutor) for that code path.
Low throughput despite high thread count+
Immediate action
Measure time spent in Python bytecode vs C extensions.
Commands
python -m cProfile myscript.py | head -20
py-spy record -o profile.svg --pid <pid>
Fix now
If C extensions dominate, GIL is less an issue; if Python code dominates, offload to subprocess or use Cython with nogil.
Unexpected serial behavior in I/O-heavy code+
Immediate action
Verify that I/O calls actually release the GIL.
Commands
strace -e trace=read,write,recvfrom -p <pid> 2>&1 | head
ltrace -e 'futex*' -p <pid>
Fix now
If I/O calls are not releasing (rare), consider asyncio which doesn't rely on GIL release.
Concurrency Models in Python
ModelGIL ImpactBest ForOverheadScaling
ThreadingSerialized (GIL held during bytecode)I/O-bound tasksLow (thread creation)1x CPU-bound, near Nx I/O-bound
MultiprocessingEach process has its own GIL (none shared)CPU-bound pure PythonMedium (process spawn, pickle)~Nx (but diminishing with IPC)
asyncioNo GIL (single thread, cooperative)I/O-bound, many concurrent tasksVery low (task switch)1x for CPU, high for I/O
Cython nogilGIL released explicitly in C codeCPU-bound numeric/scientificLow (C call overhead)Near Nx if tasks are parallelizable
Free-threaded Python 3.13No GIL (experimental)CPU-bound pure PythonLow (atomic refcount overhead)~Nx (but early, not prod-ready)

Key takeaways

1
The GIL is a mutex that serializes Python bytecode execution
it's not a bug, it's a trade-off.
2
Threads work for I/O-bound tasks; multiprocessing for CPU-bound pure Python.
3
Always profile first
GIL impact is workload-dependent.
4
Use asyncio for many concurrent I/O tasks with minimal overhead.
5
Python 3.13 free-threading is promising but not production-ready.
6
Know your C extensions
if they release the GIL, threads can parallelize.

Common mistakes to avoid

4 patterns
×

Assuming threads give parallelism for all work

Symptom
CPU-heavy code with threads shows no speedup; CPU utilization is low despite many threads.
Fix
Profile to see if workload is CPU-bound. Switch to multiprocessing or asyncio.
×

Using multiprocessing for tiny tasks

Symptom
Multiprocessing is slower than single-threaded because pickling overhead dwarfs task runtime.
Fix
Benchmark with realistic data sizes. For tasks under ~10ms, consider asyncio or threading instead.
×

Believing asyncio removes the GIL completely

Symptom
async code that contains CPU-heavy Python operations still blocks the event loop.
Fix
Move CPU-heavy parts to a ProcessPoolExecutor or thread pool with run_in_executor.
×

Ignoring C extension GIL release behavior

Symptom
Using threading with numpy expecting parallelism, but only one core used.
Fix
Check if the specific numpy functions release the GIL. Some do, some don't. Use multiprocessing if unsure.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain what the Python GIL is and why it exists.
Q02SENIOR
Does threading in Python ever give you parallelism? Under what condition...
Q03SENIOR
You have a CPU-bound Python application. How would you decide between th...
Q04SENIOR
What changes are coming in Python 3.13 regarding the GIL? Should we adop...
Q01 of 04JUNIOR

Explain what the Python GIL is and why it exists.

ANSWER
The GIL is a mutex that prevents multiple native threads from executing Python bytecode simultaneously. It exists because CPython uses reference counting for memory management, which is not thread-safe without protection. The GIL is a simple, coarse-grained lock that prevents race conditions on object reference counts. It makes single-threaded Python faster and C extension integration easier, at the cost of CPU-bound parallelism.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is GIL — Global Interpreter Lock in simple terms?
02
Does the GIL make Python slow?
03
Can I remove the GIL from my Python installation?
04
Does asyncio bypass the GIL?
05
Why doesn't Java have a GIL?
N
Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Advanced Python. Mark it forged?

13 min read · try the examples if you haven't

Previous
Python Packaging and pip
10 / 17 · Advanced Python
Next
Python Slots