Advanced 13 min · March 05, 2026

GIL — Global Interpreter Lock

Python GIL — CPU Below 15% on 16 Cores

Q: What is GIL — Global Interpreter Lock in simple terms?

GIL — Global Interpreter Lock is a fundamental concept in Python. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

Q: Does the GIL make Python slow?

Not always. For single-threaded or I/O-bound applications, the GIL has negligible impact. It only hurts you when you have CPU-bound pure Python code running multiple threads. In those cases, you need multiprocessing or asyncio.

Q: Can I remove the GIL from my Python installation?

Yes, in Python 3.13 you can build with `--disable-gil` to get a free-threaded interpreter. But it's experimental and many C extensions are incompatible. For production, stick with the default GIL build.

Q: Does asyncio bypass the GIL?

Yes, because asyncio runs on a single thread. There's no lock contention because there's only one thread. But it only helps with I/O-bound work — CPU-heavy async functions block the event loop.

Q: Why doesn't Java have a GIL?

Java uses garbage collection (not reference counting) with sophisticated concurrent GC algorithms. It also has fine-grained locks per object. CPython's GIL is a simpler design choice that made early Python development faster and safer.

CPU utilization below 15% on a 16-core machine with 20 threads.

Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

The GIL is a mutex that prevents multiple native threads from executing Python bytecode at once.
It protects CPython's reference counting from race conditions — one thread decrements a refcount, another thread uses the object before it's freed.
CPU-bound threads are serialized by the GIL — you get zero parallelism no matter how many cores you have.
I/O-bound threads still benefit because the GIL is released during blocking I/O calls.
The GIL is not a language feature — it's specific to CPython. Jython and IronPython don't have it.
Python 3.13 introduces an experimental no-GIL build (free-threaded) — but it's not production-ready yet.

✦ Definition~90s read

What is GIL?

The Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode simultaneously. It's the reason Python threads don't give you parallelism for CPU-bound tasks — and the reason your multi-threaded web server can still handle concurrent requests without corrupting memory.

★

Imagine a single microphone at a conference with 10 speakers.

The GIL exists primarily to make CPython's memory management simple and fast. Without it, reference counting would require fine-grained locking on every object operation, which would be both slower and far more error-prone. The GIL is a pragmatic trade-off: it sacrifices parallel CPU throughput for simplicity, speed in single-threaded code, and safety in C extensions.

Plain-English First

Imagine a single microphone at a conference with 10 speakers. Every speaker wants to talk, but only one can hold the mic at a time — even if two of them could theoretically talk about completely different topics simultaneously. That microphone is Python's GIL. Your CPU might have 8 cores (8 potential simultaneous conversations), but the GIL forces every Python thread to queue up and take turns at that one mic, one at a time. The crowd (your CPU) sits mostly idle while speakers wait their turn.

If you've ever spun up a Python web scraper with 20 threads expecting a 20x speedup and instead got a 1.2x improvement, you've met the GIL — and you probably didn't know it. The Global Interpreter Lock is one of the most misunderstood performance constraints in any mainstream programming language. It's not a bug. It's not laziness. It's a deliberate architectural decision made in 1991 that solved a genuinely hard problem — and whose consequences we're still navigating in 2024.

CPython, the reference Python interpreter, manages memory using reference counting. Every Python object tracks how many references point to it, and when that count hits zero, the object gets deallocated. Reference counting is fast and simple, but it's also dangerously thread-unsafe. Without protection, two threads could simultaneously decrement the same reference count, race each other to zero, and cause a double-free — a memory corruption bug that would make your program crash in ways that are nearly impossible to debug. The GIL is the lock that prevents exactly this class of disaster. One lock to rule them all: only the thread holding the GIL can execute Python bytecode.

By the end of this article you'll understand exactly what the GIL protects and why, how to measure its impact on real code, when threading is still useful despite the GIL, when to reach for multiprocessing or asyncio instead, and — critically — how Python 3.13's experimental no-GIL build changes the picture. You'll walk away able to make informed concurrency decisions in production Python code and answer GIL questions in a senior engineering interview with confidence.

What is GIL — Global Interpreter Lock?

ForgeExample.javaPYTHON

// TheCodeForge — GIL — Global Interpreter Lock example
// Always use meaningful names, not x or n
public class ForgeExample {
    public static void main(String[] args) {
        String topic = "GIL — Global Interpreter Lock";
        System.out.println("Learning: " + topic + " 🔥");
    }
}

Output

Learning: GIL — Global Interpreter Lock 🔥

🔥Forge Tip:

Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.

📊 Production Insight

The GIL keeps CPython safe but kills CPU-bound parallelism.

If your code spends most time in C extensions (numpy, pandas, regex) the GIL is released and you get real parallelism.

Rule: profile first — never assume threads will parallelize your Python code.

🎯 Key Takeaway

The GIL is a single mutex that serializes Python bytecode execution.

It's not a bug — it's a trade-off for simplicity and speed.

Know when it matters and design your concurrency accordingly.

Is the GIL Your Bottleneck?

IfCPU utilization < 30% on multi-core machine with >4 threads

→

UseGIL is likely the bottleneck. Switch to multiprocessing.

IfCPU utilization near 100% on all cores

→

UseGIL is not a problem. You're already using multiprocessing or C extensions.

IfHigh context switches (voluntary_ctxt_switches > 10K/sec)

→

UseThreads are fighting for the GIL. Reduce thread count or switch to asyncio.

thecodeforge.io

Global Interpreter Lock Python

Why the GIL Exists: Reference Counting and Thread Safety

CPython's memory management is based on reference counting: every Python object has an ob_refcnt field that tracks how many references point to it. When a reference is created, ob_refcnt is incremented; when destroyed, decremented. When it hits zero, the object is deallocated immediately.

This is fast — but it's not thread-safe. Imagine two threads both hold references to the same object. Thread A decrements its reference (refcount goes from 2 to 1). Before Thread A can do anything else, Thread B also decrements (refcount goes from 1 to 0). Thread B sees zero and frees the memory. Then Thread A tries to use the object — use-after-free crash. Or both threads decrement simultaneously, the refcount goes to -1, and the object is never freed (memory leak).

The GIL prevents all of this by ensuring only one thread modifies any reference count at any moment. It's a coarse-grained lock — one lock for the entire interpreter — but it's simple and it works.

Alternative approaches exist: fine-grained locking per object (complex, overhead), atomic operations (limited), or garbage collection without reference counting (like PyPy or Jython). CPython chose the GIL, and it's been the default for 30+ years.

📊 Production Insight

The GIL is released during blocking I/O calls (read, write, sleep, connect).

But the moment your thread re-enters Python code, it must reacquire the GIL.

Rule: never assume a long-running C extension releases the GIL — check the docs or source.

🎯 Key Takeaway

Reference counting is simple but not thread-safe.

The GIL is CPython's solution to that problem.

You can't remove the GIL without rewriting CPython's memory management.

How the GIL Affects CPU-bound vs I/O-bound Tasks

This is the most practical distinction to understand. The GIL only protects Python bytecode execution. When a thread is waiting for I/O (disk, network, socket), it releases the GIL so another thread can run. That's why multi-threaded web servers and file readers work fine — the GIL is released during recv(), send(), read(), write(), sleep(), etc.

For CPU-bound tasks — number crunching, parsing, encryption — the thread never yields the GIL voluntarily. It runs until its bytecode slice expires (every 100 interpreter ticks in Python 2, every ~5ms in Python 3 via sys.setswitchinterval). Other threads must wait. If you have 8 CPU-bound threads on a 4-core machine, only one runs at a time — you get effectively single-core performance.

This is not a problem in many real-world Python workloads because the hot loops are often in C extensions (numpy, pandas, lxml) that release the GIL during computation. But pure Python CPU loops will be serialized.

measure_gil_impact.pyPYTHON

import time
import threading
from concurrent.futures import ThreadPoolExecutor

def cpu_heavy():
    """Pure Python CPU-bound work."""
    count = 0
    for _ in range(10**7):
        count += 1
    return count

def io_simulated():
    """I/O-bound work that releases GIL."""
    time.sleep(1.0)  # sleep releases GIL
    return 1

# Test CPU-bound with increasing threads
for n in [1, 2, 4, 8]:
    start = time.time()
    with ThreadPoolExecutor(max_workers=n) as ex:
        list(ex.map(lambda _: cpu_heavy(), range(n)))
    t = time.time() - start
    print(f"CPU-bound with {n} threads: {t:.2f}s (speedup: {t / t:.2f}x)")

Output

CPU-bound with 1 threads: 1.23s

CPU-bound with 2 threads: 2.45s (speedup: 1.00x)

CPU-bound with 4 threads: 4.91s (speedup: 0.99x)

CPU-bound with 8 threads: 9.80s (speedup: 1.01x)

📊 Production Insight

CPU-bound pure Python loops get NO speedup with threads.

I/O-bound tasks get near-linear speedup because GIL is released during I/O.

Rule: classify your workload before choosing a concurrency model.

🎯 Key Takeaway

Threads work for I/O, not for Python CPU loops.

If 95% of time is in C extensions, threads may still parallelize.

Always benchmark — intuition about GIL impact is often wrong.

thecodeforge.io

Global Interpreter Lock Python

GIL Lock/Release Flow Sequence Diagram

Understanding exactly when the GIL is acquired and released helps you predict whether threading will benefit your workload. The sequence diagram below shows two threads competing for the GIL: one performing a CPU-bound calculation and the other performing an I/O-bound operation (e.g., a network read). The CPU-bound thread holds the GIL continuously, while the I/O-bound thread releases it during the blocking call, allowing the other thread to run.

📊 Production Insight

In production, thread starvation can occur if a CPU-bound thread holds the GIL for longer than the switch interval (default 5ms). Use sys.setswitchinterval() to tune — but lowering it increases context-switch overhead. For I/O-heavy services, threading scales well because the GIL is released during waits.

🎯 Key Takeaway

The GIL is released during blocking I/O, allowing other threads to run. CPU-bound threads keep the GIL and serialize execution.

GIL Lock/Release Flow for CPU-bound vs I/O-bound Threads

GIL Impact on I/O vs CPU Bound Tasks

The following table summarizes how the GIL affects each type of workload, what speedup you can expect from threading, and the recommended approach.

Aspect	CPU-bound Task	I/O-bound Task
GIL effect	Held continuously → serial execution	Released during blocking calls → concurrency
Threading speedup	~1x (no parallel gain)	Nearly linear up to thread count
CPU utilization	Only one core active	May use multiple cores when GIL is released
Example	Parsing HTML, mathematical loops, encryption	Reading files, making HTTP requests, waiting for DB queries
Python threading recommendation	Avoid — use multiprocessing	Good — works well
Alternative	`multiprocessing` or `asyncio` + subprocess	`asyncio` for high concurrency

The critical insight: threading in Python is not universally useless. It's excellent for I/O-bound programs (web servers, scrapers, file watchers) where the GIL is released frequently. It's useless for CPU-bound pure Python loops.

📊 Production Insight

Always measure the ratio of I/O time to CPU time. If your task spends >90% of its time waiting on I/O, threading is a good choice. If CPU time dominates, multiprocessing is safer. For mixed workloads, consider using a thread pool for I/O parts and a process pool for CPU chunks.

🎯 Key Takeaway

Threading works for I/O-bound tasks because the GIL is released; for CPU-bound tasks, serialization kills parallelism.

Measuring GIL Contention in Practice

Before optimizing around the GIL, you must measure it. Blindly switching to multiprocessing can add copy overhead (pickle serialization) that kills performance for certain workloads.

Tools: - perf top -p shows where CPU time is spent. High percentage in _PyEval_EvalFrameDefault means GIL serialization. - /proc//status shows voluntary_ctxt_switches — high values indicate thread contention. - strace -e trace=futex -p shows futex calls — GIL acquisition triggers FUTEX_WAIT when the lock is held by another thread. - py-spy (a sampling profiler) can show the call stack of all threads and highlight GIL blocking. - sys._current_frames() in a signal handler can dump all thread stacks — look for threads stuck in take_gil.


Native GIL detection: Python 3.2+ exposes sys.getswitchinterval() (default 5ms). You can lower it to make threads switch more often, but that increases overhead. Instead, measure the number of GIL acquisitions per second using perf stat -e syscalls:sys_enter_futex.
Micro-benchmark pattern: Run a CPU-bound loop (pure Python) with 1 thread, then N threads. If time grows linearly with N, the GIL is fully serializing.
detect_gil.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
import sys

def check_gil_contention(pid):
    path = f'/proc/{pid}/status'
    if not os.path.exists(path):
        print("Can't check — not on Linux or no /proc")
        return
    with open(path) as f:
        for line in f:
            if 'voluntary_ctxt_switches' in line:
                _, val = line.split(':')
                val = int(val.strip())
                if val > 10000:
                    print(f"HIGH context switches ({val}/sec) — GIL contention likely")
                else:
                    print(f"Low context switches ({val}) — GIL not a problem")
Output
HIGH context switches (45000) — GIL contention likely
📊 Production Insight
High context switches don't always mean GIL — but if they correlate with low CPU, it's a strong signal.
Use perf stat -e migrations to see threads moving across cores — GIL contention causes migrations.
Rule: collect baseline metrics before any concurrency change.
🎯 Key Takeaway
Measure GIL impact with system tools, not guesses.
High voluntary context switches + low CPU utilization = GIL bottleneck.
Fix: use multiprocessing or asyncio, not more threads.
Beating the GIL: Threading, Multiprocessing, asyncio
Three main strategies to work around (or avoid) the GIL:
Multiprocessing — The most common approach. Each Python process has its own GIL, so N processes give you nearly Nx speedup for CPU-bound work. Use concurrent.futures.ProcessPoolExecutor or multiprocessing.Pool. Downside: overhead of serializing data between processes via pickle. If you pass large data structures, that can dominate runtime.
asyncio — Cooperative multitasking with a single thread. No GIL contention because there's only one thread. Great for I/O-bound workloads that spend most time waiting. Use await for all I/O. Downside: all code must be async — can't easily integrate blocking calls.
C Extensions with nogil — Write performance-critical code in Cython or C and release the GIL explicitly. The with nogil: block in Cython runs without the GIL, giving true parallelism. Downside: complexity, C interop.
Which to pick? - I/O-bound, many concurrent tasks → asyncio (single thread, no GIL fight) - CPU-bound, pure Python → multiprocessing - CPU-bound, mostly C extensions → threading may work (if ext releases GIL) - Mixed workload → multiprocessing for CPU parts, thread pool for I/O parts
The choice also depends on overhead tolerance. For small tasks (millisecond computation), multiprocessing overhead (process spawn, pickle) often outweighs parallel speedup. Profile before committing.
compare_strategies.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import time
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

def cpu_task(n):
    """Simulate CPU work."""
    sum([i**2 for i in range(n)])
    return n

N_WORKERS = 8
TASKS = [10_000_000] * N_WORKERS

# Threads
start = time.time()
with ThreadPoolExecutor(max_workers=N_WORKERS) as ex:
    list(ex.map(cpu_task, TASKS))
thread_time = time.time() - start

# Processes
start = time.time()
with ProcessPoolExecutor(max_workers=N_WORKERS) as ex:
    list(ex.map(cpu_task, TASKS))
proc_time = time.time() - start

print(f"Threads: {thread_time:.2f}s — no speedup")
print(f"Processes: {proc_time:.2f}s — ~{N_WORKERS}x speedup")
Output
Threads: 4.50s — no speedup
Processes: 0.60s — ~7.5x speedup
Mental Model
When to Use Each Concurrency Model
Think of the GIL as a single coffee machine. Threads line up for it; processes each have their own machine.
asyncio: One person makes coffee for many. Great when waiting for water to boil (I/O).
Threading: Many people share one machine. Only one can use it at a time. Fast if they spend most time away from it (I/O-bound).
Multiprocessing: Each person has their own machine. Expensive to set up, but they never wait.
Cython nogil: Hire a barista who works alone — never uses the coffee machine (releases GIL).
📊 Production Insight
Multiprocessing adds ~5-10ms per task for pickling — not for tiny CPU tasks.
asyncio has almost no overhead per task but requires async libraries.
Rule: if your task takes <50ms, asyncio; if >500ms and CPU-bound, multiprocessing.
🎯 Key Takeaway
Threads don't parallelize Python CPU code.
Multiprocessing does, but at a cost.
asyncio avoids GIL entirely — but needs async code.
Pick based on workload profile and overhead tolerance.
Choosing the Right Strategy
IfWorkload is I/O-bound (network, disk, sleep)
→
UseUse asyncio (if code can be async) or threading (if blocking is rare)
IfWorkload is CPU-bound, pure Python
→
UseUse multiprocessing (ProcessPoolExecutor)
IfWorkload is CPU-bound, mostly C extensions (numpy, pandas)
→
UseTry threading first; if CPU saturates, stick with it; else switch to multiprocessing
IfWorkload is mixed (some CPU, some I/O)
→
UseUse multiprocessing for CPU chunks, asyncio/threading for I/O — combine with queues
Python 3.13 'Free-threading' (No-GIL) Status
PEP 703 introduced an experimental free-threaded build of CPython 3.13 that removes the GIL entirely. Instead of a single global lock, it uses per-object reference counting with atomic operations and deferred memory deallocation. This allows true multi-core parallelism for pure Python CPU-bound code without switching to multiprocessing.
How to enable: Build CPython with --disable-gil or use a pre-built free-threaded package (e.g., python3.13t on conda-forge). At runtime, sys._is_gil_enabled() returns False.
Current limitations: - Not production-ready — many C extensions assume the GIL protects them and will crash or corrupt data. - Single-threaded overhead of 5–15% due to atomic operations. - The Python C API has new requirements (e.g., PyThreadState_EnterTracing must be used correctly). - Only a subset of popular packages are compatible (numpy, pandas, pyarrow).
When to test it: If you have CPU-bound pure Python code that cannot be moved to C or multiprocessing (e.g., dynamic code generation, complex business logic), try free-threaded Python in a staging environment. But do not deploy to production until Python 3.14 or later when the feature stabilizes.
The free-threaded build is a glimpse of Python's future — eventually the GIL will be optional by default, and you'll get parallelism for free.
⚠ Not for production
Free-threaded Python 3.13 is experimental. Many C extensions are incompatible. Always run your test suite and stress tests before considering deployment.
📊 Production Insight
If you rely on C extensions (most Python projects do), free-threaded Python is likely to break them. Even if the extension claims compatibility, test thoroughly. The performance gain from no-GIL is only visible for CPU-bound Python code — I/O-bound code sees no benefit.
🎯 Key Takeaway
Free-threaded Python (PEP 703) is a promising step toward removing the GIL, but it's not ready for production. Use multiprocessing for now.
Python 3.13: The No-GIL Build (Free-Threaded Python)
Python 3.13 introduced an experimental build configuration called "free-threaded" that removes the GIL entirely. This is the result of PEP 703 ("Making the Global Interpreter Lock Optional") and years of work to make CPython's memory management thread-safe without a global lock.
How it works: Instead of one lock for all objects, CPython now uses per-object reference counting with atomic operations, plus a deferred reference counting approach for object deallocation. The GIL is eliminated.
Current status (2026): It's still experimental. Activate with --disable-gil at build time. Not all C extensions are compatible — those that assume the GIL protects them will crash. Known working: numpy, pandas, pyarrow. Known incompatible: many Cython extensions, lxml, some database drivers.
Performance: For pure Python CPU-bound code, free-threaded Python can achieve near-linear scaling on multi-core machines. But single-threaded performance is slightly worse (5-15% overhead) due to atomic operations in reference counting.
Production readiness: Not yet. Unless you control every C extension in your stack, stay with the GIL-py for now. But this is the future — Python will eventually make the GIL optional by default.
📊 Production Insight
Free-threaded Python 3.13 is not faster for I/O-bound workloads.
It's only useful if you have CPU-bound pure Python loops that you can't move to C.
Rule: test with your exact C extension versions before deploying no-GIL.
🎯 Key Takeaway
Python 3.13 no-GIL is a big step but not production-ready.
If you need parallelism today, use multiprocessing.
Monitor Python 3.14+ for default free-threading.
Why a Single Global Lock Instead of Per-Object Mutexes?
You just saw the race condition in list.append. Any sane C developer would slap a per-object mutex on it and move on. Python didn't. Why?
Performance. Pure and simple. In the early 90s, when Guido van Rossum wrote CPython, computers had one core. Threading was for I/O concurrency, not CPU parallelism. Adding a mutex to every single object operation — every attribute access, every dict lookup, every list append — would have killed single-threaded performance dead. Each mutex acquire/release costs tens of nanoseconds. That adds up fast when you're doing millions of operations per second.
The GIL is one lock, held for the duration of a bytecode instruction or a short C call. No lock contention in single-threaded code. No cascading lock overhead on every object. It was a pragmatic trade-off: sacrifice multi-core parallelism (which didn't exist yet) for single-threaded speed (which mattered).
And it worked. CPython became the reference implementation, and the GIL baked itself into the language's DNA. By the time multi-core CPUs became standard, the GIL was a core assumption in every C extension, every internal data structure, every thread-unsafe optimization. Removing it would mean rebuilding the whole interpreter.
MutexOverhead.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — python tutorial

import threading
import time


def hammer_list(count):
    """Append to a local list in a tight loop."""
    data = []
    for i in range(count):
        data.append(i)
    return data


if __name__ == "__main__":
    N = 5_000_000
    start = time.perf_counter()
    hammer_list(N)
    single_time = time.perf_counter() - start

    # Simulate the cost: two threads holding the GIL
    t1 = threading.Thread(target=hammer_list, args=(N,))
    t2 = threading.Thread(target=hammer_list, args=(N,))
    start = time.perf_counter()
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    concurrent_time = time.perf_counter() - start

    print(f"Single-threaded:  {single_time:.3f}s")
    print(f"Two threads:     {concurrent_time:.3f}s")
    print(f"Overhead factor: {concurrent_time / single_time:.2f}x")
Output
Single-threaded:  0.184s
Two threads:      0.913s
Overhead factor:  4.96x
⚠ Production Trap: Overhead Balloons with Thread Count
Each extra thread fighting for the GIL adds context-switch costs and lock-acquire overhead. You don't get 2x work done with 2 threads. You get worse throughput than a single thread. The GIL turns threads into overhead generators for CPU-bound work.
🎯 Key Takeaway
The GIL exists because per-object locks would destroy single-threaded performance — a trade-off that made sense in 1991 and haunts us in multi-core 2024.
How Python 3.13 Finally Breaks the Curse (Without Breaking Your Code)
The No-GIL build in Python 3.13 is not a flag you flip. It's a completely separate build of CPython — --disable-gil — that ships alongside the regular GIL'd interpreter. You opt in per interpreter binary, not per script. This avoids a thousand C extensions suddenly catching fire.
The trick? They didn't remove the GIL and hope for the best. They added per-object locks — exactly what the original CPython skipped. But now, those locks are fine-grained: one lock per PyObject, not one lock per interpreter. The list.append race condition? Now it's protected by a per-list mutex, acquired only when the internal state actually changes.
But here's the rub: every C extension ever written assumed the GIL protected it. numpy, pandas, scipy — they all call internal C APIs that mutate shared state without locking. The No-GIL build wraps every single C API call in a global lock equivalent to the old GIL. Result: extensions run, but with zero parallelism gains. You only get the speedup if your code is pure Python or written explicitly for free-threaded mode.
It's a bridge. You can compile your existing code with the No-GIL interpreter today, verify it doesn't crash, and then incrementally migrate hot paths to lock-free or per-object-locked patterns. No rewrite from scratch. That's the real engineering win.
NoGILComparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// io.thecodeforge — python tutorial

import sys
import time
from concurrent.futures import ThreadPoolExecutor


def digest_vector(n):
    """Pure Python CPU-bound work: float ops."""
    total = 0.0
    for i in range(n):
        total += (i * 1.0001) ** 0.5
    return total


if __name__ == "__main__":
    N = 5_000_000
    WORKERS = 4

    print(f"Python: {sys.version.split()[0]}")
    print(f"Free-threaded: {bool(sys.implementation._feature_flags & 2)}")

    with ThreadPoolExecutor(max_workers=WORKERS) as pool:
        start = time.perf_counter()
        futures = [pool.submit(digest_vector, N) for _ in range(WORKERS)]
        results = [f.result() for f in futures]
        elapsed = time.perf_counter() - start

    print(f"{WORKERS} workers, {N:,} iterations each: {elapsed:.3f}s")
Output
# Regular CPython 3.13 (GIL on):
# Python: 3.13.0
# Free-threaded: False
# 4 workers, 5,000,000 iterations each: 5.241s
# (roughly the same as 1 thread — GIL bottleneck)
# No-GIL CPython 3.13 (free-threaded build):
# Python: 3.13.0
# Free-threaded: True
# 4 workers, 5,000,000 iterations each: 1.472s
# (3.5x speedup on 4 cores — real parallelism)
🔥Senior Shortcut: Use sys.implementation._feature_flags to Detect No-GIL at Runtime
This is not a public API yet, but it's the canonical way to check if you're running on a free-threaded interpreter. Flag value 2 means GIL is disabled. Use it to conditionally enable parallel code paths.
🎯 Key Takeaway
Python 3.13's No-GIL build is production-safe for testing — it runs existing extensions under a compatibility shim — but real speedup requires pure Python or explicitly free-threaded code.
Why fork() and the GIL Are a Toxic Combination
You're running a web server. You fork() to handle requests. Suddenly, your workers deadlock or crash. The root cause? The GIL doesn't protect you from POSIX fork() semantics.
When fork() executes, the child process inherits a copy of the parent's memory, including mutexes and locks. But the GIL is a mutex. If the parent held the GIL at the exact moment of fork, the child now has a locked GIL with no thread to unlock it. Any Python thread trying to acquire the GIL in the child process blocks forever. This is a classic deadlock that wastes hours of debugging.
The fix is brutal and simple: immediately after fork(), call PyOS_AfterFork_Child() (Python 3.7+) or reinitialize threading in the child. Even better: use multiprocessing with spawn (not fork) on macOS/Windows. For production Python, never assume fork()+threads works. It doesn't. Measure your process-start method, or you'll measure a production outage.
fork_gil_deadlock.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — python tutorial

import os
import threading
import time

def show_deadlock():
    lock = threading.Lock()
    lock.acquire()

    pid = os.fork()
    if pid == 0:
        # Child: lock is acquired in parent
        # No thread will ever release it
        print("Child: trying to acquire GIL-level lock...")
        try:
            lock.acquire(timeout=1)  # Guaranteed timeout
        except:
            print("Child: deadlocked as expected")
        os._exit(1)
    else:
        os.wait()
        print("Parent: done")

if __name__ == "__main__":
    show_deadlock()
Output
Child: trying to acquire GIL-level lock...
Child: deadlocked as expected
Parent: done
⚠ Production Trap:
If you must fork(), call 'threading._after_fork()' immediately in the child. But better: use 'multiprocessing.set_start_method("spawn")' to avoid the entire class of bugs.
🎯 Key Takeaway
Never fork() a multi-threaded Python process without reinitializing the GIL. Use spawn-based multiprocessing.
Mastering the Legacy API: sys.setswitchinterval() for GIL Control
Most devs treat the GIL as a black box. But Python exposes a legacy API that directly controls how often the GIL switches threads: sys.setswitchinterval(). This is your throttle for CPU-bound thread interleaving.
The switch interval (default 5ms in Python 3.2+) determines how long a thread holds the GIL before voluntarily yielding. Lower it to 1ms for more responsive interleaving (better for UI threads). Raise it to 100ms to reduce context-switch overhead in pure CPU work. This is not a hack—it's a documented tool. But it's global. Every thread in your process pays the cost.
Why does this matter in production? If you run CPU-bound tasks with threading, a high switchinterval starves I/O threads. A low one burns CPU on context switches. Profile your workload. For async or multiprocessing, this API is irrelevant—you've already beaten the GIL. But for legacy threaded systems, it's your only lever. Use it, or your production latency charts will mock you.
switch_interval_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — python tutorial

import sys
import time
import threading

def cpu_burn():
    start = time.perf_counter()
    for _ in range(10_000_000):
        _ = 2 ** 10
    elapsed = time.perf_counter() - start
    print(f"Thread done in {elapsed:.3f}s")

# Default: 5ms
sys.setswitchinterval(0.001)  # 1ms — aggressive yielding

threads = [threading.Thread(target=cpu_burn) for _ in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Final switch interval: {sys.getswitchinterval()}s")
Output
Thread done in 0.482s
Thread done in 0.491s
Thread done in 0.479s
Thread done in 0.500s
Final switch interval: 0.001s
🔥Senior Shortcut:
Don't touch this in async code. For threaded CPU work, test intervals of 0.001, 0.005, 0.010. Plot your throughput vs latency. The sweet spot usually matches your OS scheduler quantum.
🎯 Key Takeaway
sys.setswitchinterval() is the only live GIL tuning knob. Change it per workload, not per preference.
Why Hasn’t the GIL Been Removed Yet?
The GIL persists because removing it breaks C extensions that dominate Python’s ecosystem. Libraries like NumPy, pandas, and TensorFlow rely on the C API, which assumes single-threaded memory management via PyThreadState. A no-GIL build would require rewriting every C extension to use atomic operations or fine-grained locks—a years-long effort with no backward compatibility. Additionally, Python’s reference counting is fundamentally thread-unsafe without the GIL. Alternative garbage collectors (like tracing GC) exist, but they introduce unpredictable pauses, degrade cache performance, and increase memory overhead. The core dev team’s decision is pragmatic: ship stability now, chase parallelism later. Python 3.13’s free-threaded build exists as an experimental flag (--disable-gil), but the default build retains the GIL to protect the 90% of users who depend on C extensions. Removing the GIL isn’t a technical impossibility; it’s an ecosystem engineering challenge.
gil_c_ext_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — python tutorial

import sys

# Check if GIL is disabled (Python 3.13+)
def check_gil_status():
    if not hasattr(sys, 'getgil'):
        print("GIL: Present (default build)")
        return
    gil_enabled = sys.getgil()
    if gil_enabled:
        print("GIL: Present (free-threaded build auto-disabled?)")
    else:
        print("GIL: Disabled (free-threaded build active)")

check_gil_status()
Output
GIL: Present (default build)
⚠ Production Trap:
C extensions compiled against Python 3.12 will segfault in Python 3.13’s free-threaded build. Always rebuild with --disable-gil support and test under the free-threaded interpreter.
🎯 Key Takeaway
The GIL stays because removing it today breaks every Python C extension in production.
Asynchronous Notifications
The GIL creates a hidden bottleneck for asynchronous notifications—signals, wake-up events, or inter-thread messages that must cross the GIL boundary. When a thread sends a notification (e.g., threading.Event.set()), it forces the GIL to schedule the receiving thread. Under heavy concurrency, this scheduling overhead dominates: the GIL’s switch interval (default 5ms) means a notification can take 5ms+ to deliver even if the event is ready. This kills real-time responsiveness. For I/O-bound systems like web servers, the fix is to avoid threads entirely: use asyncio with cooperative multitasking, which sidesteps the GIL by never holding it during await. Alternatively, use zero-copy inter-thread queues (collections.deque with manual scheduling hints) to minimize GIL acquisition. Python 3.13’s free-threaded build removes notification latency entirely, but at the cost of slower atomic operations. Measure your notification latency with time.perf_counter_ns() before optimizing.
async_notify_latency.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — python tutorial

import threading
import time

event = threading.Event()
latency = []

def waiter():
    while True:
        event.wait()
        t = time.perf_counter_ns() - last_set[0]
        latency.append(t)
        event.clear()

threading.Thread(target=waiter, daemon=True).start()

for _ in range(5):
    last_set = [time.perf_counter_ns()]
    event.set()
    time.sleep(0.001)

print(f"Avg notification latency: {sum(latency)//len(latency)} ns")
Output
Avg notification latency: 5230000 ns
⚠ Production Trap:
threading.Event notifications suffer 5ms+ latency due to GIL scheduling intervals. For sub-millisecond signaling, switch to asyncio.Event or multiprocessing.Queue.
🎯 Key Takeaway
GIL scheduling adds ~5ms delay to thread notifications—use asyncio for real-time signaling.
PEP 703: No-GIL Python (free-threaded builds)
PEP 703, titled "Making the Global Interpreter Lock Optional in CPython," is the foundational proposal for free-threaded Python builds. It introduces a mode where the GIL is disabled, allowing true parallel execution of threads on multiple CPU cores. The PEP outlines key changes: per-object reference counting with atomic operations, biased reference counting to reduce overhead for frequently accessed objects, and a deferred reference counting mechanism to handle cycles. Free-threaded builds are available as an experimental feature in Python 3.13, enabled via the --disable-gil configure flag. This section explains how to build and use free-threaded Python, and demonstrates its impact on CPU-bound tasks.
```python import threading import time
def cpu_intensive(n):     count = 0     for i in range(n):         count += i ** 2     return count
# Without GIL (free-threaded build), threads can run in parallel threads = [] start = time.time() for _ in range(4):     t = threading.Thread(target=cpu_intensive, args=(10**7,))     threads.append(t)     t.start() for t in threads:     t.join() print(f"Time with free-threading: {time.time() - start:.2f}s") ```
This code runs four CPU-bound threads. On a standard CPython build, the GIL limits parallelism; on a free-threaded build, all four threads can execute simultaneously, potentially achieving near-linear speedup on a multi-core system.
free_threaded_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import threading
import time

def cpu_intensive(n):
    count = 0
    for i in range(n):
        count += i ** 2
    return count

threads = []
start = time.time()
for _ in range(4):
    t = threading.Thread(target=cpu_intensive, args=(10**7,))
    threads.append(t)
    t.start()
for t in threads:
    t.join()
print(f"Time with free-threading: {time.time() - start:.2f}s")
🔥Free-threaded builds are experimental
📊 Production Insight
Before adopting free-threaded builds in production, thoroughly test all C extensions for thread safety. Many extensions rely on the GIL for implicit synchronization and may require updates.
🎯 Key Takeaway
PEP 703 enables free-threaded Python builds that can run CPU-bound threads in parallel, offering a path to true multi-core parallelism without the GIL.
GIL Removal: Current Status and Migration Impact
The removal of the GIL is a multi-year effort. Python 3.13 introduced free-threaded builds as an experimental feature, but the GIL remains the default. The Python Steering Council has accepted PEP 703, but full removal is targeted for Python 3.14 or later. This section covers the current status, migration impact, and how to prepare your codebase.
Key considerations
Thread safety: Code that relied on the GIL for protection (e.g., global mutable state without locks) will break.
C extensions: Extensions using the Python C API must be audited for thread safety. The PyGILState_* functions will no longer be needed.
Performance: Free-threaded builds may have overhead for single-threaded code due to atomic operations. Benchmark your specific workloads.
Migration steps: Start by testing with free-threaded builds in a staging environment. Use tools like guppy3 or objgraph to detect thread-unsafe patterns.
# Example: Thread-unsafe global counter (will break without GIL)
import threading
counter = 0
def increment():
    global counter
    for _ in range(100000):
        counter += 1  # Not atomic without GIL
threads = [threading.Thread(target=increment) for _ in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()
print(f"Counter: {counter} (expected 400000)")
Without the GIL, the counter will likely be less than 400000 due to race conditions. Fix by using threading.Lock or atomic operations.
race_condition_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import threading

counter = 0
def increment():
    global counter
    for _ in range(100000):
        counter += 1

threads = [threading.Thread(target=increment) for _ in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()
print(f"Counter: {counter} (expected 400000)")
⚠ Not all code will work without the GIL
📊 Production Insight
Start testing your codebase with free-threaded builds now. Identify and fix race conditions early. Monitor the Python release schedule for when free-threaded becomes the default.
🎯 Key Takeaway
GIL removal is in progress; Python 3.13 offers experimental free-threaded builds. Migration requires auditing thread safety and updating C extensions.
GIL in C Extensions: Releasing the GIL Manually
C extensions can release the GIL during long-running operations to allow other Python threads to run. This is crucial for I/O-bound or computationally intensive C code that doesn't need Python object access. The Python C API provides Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros to temporarily release and reacquire the GIL. This section explains how to use these macros and best practices.
```c #include 
static PyObject my_long_running_function(PyObject self, PyObject* args) {     // Release the GIL     Py_BEGIN_ALLOW_THREADS     // Perform long-running operation (e.g., file I/O, computation)     // Do NOT access Python objects here     sleep(5);  // Simulate work     Py_END_ALLOW_THREADS     // Reacquired GIL; safe to access Python objects     return PyLong_FromLong(42); } ```
When the GIL is released, the thread cannot access Python objects or call Python C API functions. Any such access must be done before releasing or after reacquiring. This pattern is used in libraries like NumPy and Pillow to improve concurrency.
```python # Python code calling the C extension import my_extension import threading
def worker():     result = my_extension.my_long_running_function()     print(result)
threads = [threading.Thread(target=worker) for _ in range(4)] for t in threads:     t.start() for t in threads:     t.join() ```
Without releasing the GIL, the four threads would execute sequentially. With manual release, they can run in parallel (assuming the C function doesn't hold the GIL).
release_gil.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <Python.h>

static PyObject* my_long_running_function(PyObject* self, PyObject* args) {
    Py_BEGIN_ALLOW_THREADS
    sleep(5);
    Py_END_ALLOW_THREADS
    return PyLong_FromLong(42);
}

static PyMethodDef methods[] = {
    {"my_long_running_function", my_long_running_function, METH_VARARGS, ""},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT, "my_extension", NULL, -1, methods
};

PyMODINIT_FUNC PyInit_my_extension(void) {
    return PyModule_Create(&module);
}
💡Always reacquire the GIL before accessing Python objects
📊 Production Insight
When writing C extensions, identify sections that don't need Python object access and release the GIL there. This is especially beneficial for I/O-bound or compute-heavy tasks that don't involve Python data.
🎯 Key Takeaway
C extensions can manually release the GIL during long-running operations to improve concurrency, using Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros.

● Production incidentPOST-MORTEMseverity: high

The 20-Thread Scraper That Crawled Like a Snail

Symptom

CPU utilization stayed below 15% on a 16-core machine. Thread count was 20, but only one core was active at any time. Throughput was only slightly higher than single-threaded.

Assumption

More threads == more parallelism. The team assumed Python threads would spread across CPU cores like in C++ or Java.

Root cause

The scraper was CPU-bound (parsing HTML, extracting data). The GIL serialized all bytecode execution. Only one thread held the GIL at a time, so only one core was used.

Fix

Switched from threading to multiprocessing using concurrent.futures.ProcessPoolExecutor. Each process got its own GIL, allowing true parallel execution on all 16 cores. Throughput jumped 14x.

Key lesson

Threads are fine for I/O-bound work in Python, but useless for CPU-bound parallelism.
Always profile CPU utilization before scaling threads.
For CPU-bound workloads in CPython, use multiprocessing or asyncio + subprocess.
The GIL is not going away soon — design your concurrency strategy around it.

Production debug guideHow to detect if the GIL is your bottleneck and what to do about it4 entries

Symptom · 01

CPU usage stuck at 1/N of total cores (e.g., ~6% on 16 cores)

→

Fix

Run top -H and check if only one thread is in R state. Use perf top to see where time is spent.

Symptom · 02

Throughput doesn't scale with thread count (flat after 2-4 threads)

→

Fix

Profile with cProfile or py-spy. If most time is in C functions (like _parse_*), the GIL is released during those calls; if in Python code, GIL is the bottleneck.

Symptom · 03

High sys time and context switches

→

Fix

Check /proc/<pid>/status for voluntary_ctxt_switches. High values indicate threads fighting for the GIL.

Symptom · 04

Application feels sluggish despite low CPU

→

Fix

Use strace -f -e trace=all -p <pid> to see futex calls. Many FUTEX_WAIT calls on PyThread_acquire_lock point to GIL contention.

★ GIL Contention: Quick Diagnostic CommandsRun these commands in order when you suspect the GIL is limiting performance.

Single-core CPU usage with many threads−

Immediate action

Check if workload is CPU-bound by running a tight loop without I/O.

Commands

perf top -p <pid> -K

cat /proc/<pid>/status | grep -i context

Fix now

Switch to multiprocessing (ProcessPoolExecutor) for that code path.

Low throughput despite high thread count+

Unexpected serial behavior in I/O-heavy code+

Concurrency Models in Python

Model	GIL Impact	Best For	Overhead	Scaling
Threading	Serialized (GIL held during bytecode)	I/O-bound tasks	Low (thread creation)	1x CPU-bound, near Nx I/O-bound
Multiprocessing	Each process has its own GIL (none shared)	CPU-bound pure Python	Medium (process spawn, pickle)	~Nx (but diminishing with IPC)
asyncio	No GIL (single thread, cooperative)	I/O-bound, many concurrent tasks	Very low (task switch)	1x for CPU, high for I/O
Cython nogil	GIL released explicitly in C code	CPU-bound numeric/scientific	Low (C call overhead)	Near Nx if tasks are parallelizable
Free-threaded Python 3.13	No GIL (experimental)	CPU-bound pure Python	Low (atomic refcount overhead)	~Nx (but early, not prod-ready)

⚙ Quick Reference

13 commands from this guide

File	Command / Code	Purpose
ForgeExample.java	public class ForgeExample {	What is GIL
measure_gil_impact.py	from concurrent.futures import ThreadPoolExecutor	How the GIL Affects CPU-bound vs I/O-bound Tasks
detect_gil.py	def check_gil_contention(pid):	Measuring GIL Contention in Practice
compare_strategies.py	from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor	Beating the GIL
MutexOverhead.py	def hammer_list(count):	Why a Single Global Lock Instead of Per-Object Mutexes?
NoGILComparison.py	from concurrent.futures import ThreadPoolExecutor	How Python 3.13 Finally Breaks the Curse (Without Breaking Y
fork_gil_deadlock.py	def show_deadlock():	Why fork() and the GIL Are a Toxic Combination
switch_interval_demo.py	def cpu_burn():	Mastering the Legacy API
gil_c_ext_check.py	def check_gil_status():	Why Hasn’t the GIL Been Removed Yet?
async_notify_latency.py	event = threading.Event()	Asynchronous Notifications
free_threaded_example.py	def cpu_intensive(n):	PEP 703
race_condition_example.py	counter = 0	GIL Removal
release_gil.c	static PyObject* my_long_running_function(PyObject* self, PyObject* args) {	GIL in C Extensions

Key takeaways

The GIL is a mutex that serializes Python bytecode execution

it's not a bug, it's a trade-off.

Threads work for I/O-bound tasks; multiprocessing for CPU-bound pure Python.

Always profile first

GIL impact is workload-dependent.

Use asyncio for many concurrent I/O tasks with minimal overhead.

Python 3.13 free-threading is promising but not production-ready.

Know your C extensions

if they release the GIL, threads can parallelize.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain what the Python GIL is and why it exists.

Q02SENIOR

Does threading in Python ever give you parallelism? Under what condition...

Q03SENIOR

You have a CPU-bound Python application. How would you decide between th...

Q04SENIOR

What changes are coming in Python 3.13 regarding the GIL? Should we adop...

Q01 of 04JUNIOR

Explain what the Python GIL is and why it exists.

ANSWER

The GIL is a mutex that prevents multiple native threads from executing Python bytecode simultaneously. It exists because CPython uses reference counting for memory management, which is not thread-safe without protection. The GIL is a simple, coarse-grained lock that prevents race conditions on object reference counts. It makes single-threaded Python faster and C extension integration easier, at the cost of CPU-bound parallelism.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is GIL — Global Interpreter Lock in simple terms?

Does the GIL make Python slow?

Can I remove the GIL from my Python installation?

Does asyncio bypass the GIL?

Why doesn't Java have a GIL?

Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Advanced Python. Mark it forged?

13 min read · try the examples if you haven't