Python Performance Optimisation — O(n*m) Loop Cost $180K
87% of runtime was a linear membership check on 200K entries.
- Python performance optimisation is the discipline of measuring, understanding, and eliminating bottlenecks at the bytecode, memory, and concurrency level
- Always profile before optimising — cProfile and line_profiler reveal the real hotspots, not your intuition
- __slots__ reduces per-object memory by 40-60% by eliminating the instance __dict__
- The GIL prevents true thread parallelism — use multiprocessing for CPU-bound work, asyncio for I/O-bound work
- Vectorisation with NumPy shifts loops from Python bytecode to optimised C — 100-500x speedups are common on numeric data
- Blind optimisation is the biggest mistake: speeding up code that accounts for 2% of runtime yields zero production impact
Imagine you are running a restaurant kitchen. Your head chef is brilliant but sometimes walks across the entire kitchen to grab a single spoon instead of keeping the most-used tools on the counter right next to the stove. Python performance optimisation is the art of rearranging that kitchen — putting the right tools within arm's reach, hiring a second chef for heavy prep work like butchering (multiprocessing), and pre-chopping vegetables before the dinner rush (caching). You are not replacing the chef or the kitchen. You are making the environment smarter so the chef never wastes a step. The one thing that kills kitchens is a manager who rearranges the entire kitchen based on a hunch about where the slowdown is — then discovers the real problem was that someone was microwaving each plate one at a time. That is what blind optimisation without profiling looks like.
Python is fast enough — until it isn't. The moment your data pipeline crawls past its midnight SLA, your API starts timing out under load, or your ML preprocessing loop becomes the bottleneck before a single model even trains, 'fast enough' stops being a philosophy and becomes a liability. The language's dynamic, expressive nature comes with measurable overhead at every layer: attribute lookup, memory allocation, the Global Interpreter Lock, and bytecode interpretation all stack up in ways that bite production systems in ways that are genuinely hard to predict without measurement.
The real problem is not that Python is slow. The real problem is that most developers optimise blindly. They reach for NumPy before profiling, rewrite loops as C extensions before measuring allocations, and add workers before understanding whether their bottleneck is CPU-bound or I/O-bound. I have watched teams spend a week parallelising code that had a quadratic algorithm inside it. More workers, same fundamental problem, more cloud spend. Blind optimisation is how you spend three days speeding up code that accounts for 2% of your runtime while the real bottleneck sits untouched.
This guide moves past surface-level advice into CPython internals: how the bytecode interpreter executes your code, why attribute lookup is more expensive than it looks, how __slots__ changes memory layout at the C struct level, and exactly when to reach for multiprocessing versus asyncio versus NumPy versus Cython. These are the patterns that separate a developer who knows Python from one who can own a high-performance Python system in production and explain every decision they made.
What is Python Performance Optimisation?
Python Performance Optimisation is the discipline of systematically measuring, understanding, and eliminating bottlenecks in Python applications. It is not guesswork and it is not heroics — it is a structured process that starts with measurement and only then moves to action.
The discipline spans multiple layers, and the order matters. Algorithmic complexity is the first layer — an O(n²) algorithm running on a million records is not a Python problem, it is a correctness problem in disguise. Fixing it in NumPy instead of fixing the algorithm gives you a faster wrong answer. Memory layout and allocation patterns come next — every Python object carries overhead that adds up when you are creating millions of them. Concurrency model selection follows — the right model depends entirely on whether your bottleneck is CPU-bound or I/O-bound, and choosing the wrong one makes things worse. Native acceleration (NumPy, Cython, Numba) sits at the innermost layer and should only be reached after the outer layers are correctly addressed.
CPython's dynamic nature means that performance intuition is systematically wrong in ways that are hard to reason about from first principles. Attribute lookups involve dictionary traversal. Integer arithmetic involves reference count manipulation. Function calls involve frame allocation on the heap. These costs do not exist in statically compiled languages, and they stack up in ways that only profiling reveals. The first rule — the one rule that matters more than every optimisation technique in this article — is always profile before you optimise.
- Layer 1 — Algorithmic: O(n²) vs O(n log n) — the biggest wins come from reducing complexity class, not micro-optimising inside a bad algorithm
- Layer 2 — Memory: object allocation, __slots__, generators — reducing allocations reduces GC pressure, cache pressure, and memory footprint simultaneously
- Layer 3 — Concurrency: GIL, multiprocessing, asyncio — choosing the wrong concurrency model for the workload type makes things measurably worse, not better
- Layer 4 — Native acceleration: NumPy vectorisation, Cython, Numba — shifting hot loops from Python bytecode to C is the right final step, not the first one
- Rule: never jump to Layer 4 without profiling at Layer 1 first — the most expensive optimisation is the one you apply in the wrong place
Python Data Structure Time Complexity (Big O) — List, Dict, Set Operations
Choosing the right data structure is the most impactful performance decision you can make — it operates at the algorithmic layer and often yields 100x or 1000x speedups without any other optimisation. When you understand the time complexity of Python's built-in data structures, you can avoid the O(n*m) pattern that cost $180K in the production incident above.
The table below summarises the average-case and amortised worst-case time complexities for the most common operations on list, dict, set, and tuple. These are CPython's implementation-specific complexities — they are not language guarantees, but they are stable across Python 3.x releases and extremely unlikely to change.
``text Operation | List | Dict | Set | Tuple -----------------------------------|-------------------|-------------------|-------------------|------------------- Index lookup | O(1) | N/A | N/A | O(1) Append / insert at end | O(1) amortised | N/A | N/A | N/A Pop from end | O(1) | N/A | N/A | N/A Insert at arbitrary index | O(n) | N/A | N/A | N/A Delete by index | O(n) | N/A | N/A | N/A Membership check (in) | O(n) | O(1) average | O(1) average | O(n) Get item (key access) | N/A | O(1) average | N/A | N/A Set item (key assignment) | N/A | O(1) average | N/A | N/A Delete key | N/A | O(1) average | O(1) average | N/A Iteration | O(n) | O(n) | O(n) | O(n) Slicing [i:j] | O(k) (k = slice length) | N/A | N/A | O(k) Copy (list.copy, dict.copy) | O(n) | O(n) | O(n) | O(n) ``
The crucial takeaway: membership checks (x in list) are O(n) — the list must be scanned linearly from the start until the element is found (or the list ends). For a 200K-entry list, the worst case is 200K comparisons. When you perform that check for each of 50M records, you get the $180K incident. Switching to a set or dict for the lookup reduces membership checks to O(1) — a single hash computation and lookup regardless of container size.
This is not a subtle optimisation. It is the difference between a pipeline that runs in minutes and one that runs for 14 hours. Always profile to confirm that the membership check is the bottleneck, but in data processing pipelines with large reference tables, the fix is almost always to use a set or dict for the lookup side.
- Dict and set insert/lookup degrade to O(n) in the worst case if every key has the same hash (hash collision attack). CPython randomises hash seeding to mitigate this, but it's still possible with custom __hash__ implementations.
- For string and integer keys, hash collisions are extremely rare in practice — CPython's polynomial hash for strings and identity-based hash for ints are well-distributed.
- Avoid using mutable objects (list, dict) as dictionary keys — they are not hashable and will raise TypeError. frozenset is the immutable set alternative.
- Rule: when in doubt, profile with realistic data to confirm that O(1) holds. For most production workloads, dict and set behave as expected.
The Speed of Built-ins: map(), filter(), zip() vs List Comprehensions
Python provides several built-in functional tools — map(), filter(), zip() — that can sometimes be faster than the equivalent list comprehension, but the performance difference is rarely large enough to justify sacrificing readability. The table below compares these constructs across several common patterns, including execution time, memory behaviour, and readability.
``text Pattern | Approach | Time (10M items) | Memory | Readability ----------------------------------|------------------|------------------|---------------|---------------- Transform x -> f(x) | map(f, iter) | ~0.40s | lazy (iter) | Good with simple f | [f(x) for x in] | ~0.45s | O(n) list | Excellent | (f(x) for x in) | ~0.41s | lazy (gen) | Good Filter x -> bool(x) | filter(pred, iter)| ~0.38s | lazy (iter) | Fair | [x for x in if p]| ~0.42s | O(n) list | Excellent | (x for x in if p)| ~0.39s | lazy (gen) | Good Zip two iterables | zip(a, b) | ~0.27s | lazy (iter) | Excellent | [(a[i], b[i])..] | ~0.55s | O(n) list | Poor | ((a[i], b[i])..) | ~0.50s | lazy (gen) | Poor ``
- When the function being mapped is a built-in (like int, len, or a C-extension function),
map()is often 10-15% faster because it avoids the overhead of compiling each iteration of the comprehension's bytecode. - When the function is a lambda,
map()is usually slightly slower than a list comprehension because the lambda adds its own function call overhead — and comprehensions are optimised at the bytecode level. - filter() is typically 5-10% faster than a conditional comprehension when the predicate is a built-in or simple lambda, but again the difference is marginal.
- zip() is significantly faster than manual indexing because it works at the C level, whereas index loops involve Python-level method calls.
In modern Python (3.10+), list comprehensions are preferred for readability and debuggability. The only cases where map/filter/zip provide a meaningful performance advantage are: 1. The function is a built-in written in C (e.g., map(int, data)). 2. You need lazy evaluation without consuming memory — map and filter return iterators, list comprehensions produce full lists. 3. Combining multiple operations (zip(map(f, a), b)) where intermediate lists would waste memory.
When in doubt, use a list comprehension. If profiling shows the comprehension is a bottleneck, test the map/filter alternative, but measure with production data before committing the change.
- Use map when: the transformation function is a built-in (map(str.strip, lines)), you need a lazy iterator, or you are combining with other functional tools like filter.
- Use list comprehension when: the transformation is a simple expression (x * 2), you need a list result, or readability is more important than the 10% speed difference.
- In production, the bottleneck is almost never between map and list comprehension — it's algorithmic complexity or I/O. Don't spend time micro-optimising here unless profiling shows it's the top hotspot.
- For very large datasets where memory is a concern, prefer generator expressions ( (x * 2 for x in data) ) over both map and list comprehensions.
filter() are slightly faster than list comprehensions when used with built-in C functions, but the difference is small (10-20%). Use list comprehensions for readability by default, switch to map/filter only when profiling indicates they are a bottleneck. zip() is significantly faster than manual indexing — always use zip for pairing iterables.Memory Layout: __slots__, Object Overhead and Allocation Patterns
Every Python object carries overhead that most developers do not think about until it bites them in production. A plain class instance stores its attributes in a dynamic __dict__ — a hash table that can hold arbitrary key-value pairs. Before you store a single field of your own data, that __dict__ costs 104-232 bytes depending on the Python version and the number of entries. The object header itself adds another 48 bytes. At one million instances, you are paying 150MB in pure overhead before your data even enters the picture.
__slots__ eliminates the __dict__ entirely. When you declare __slots__ = ('field1', 'field2'), CPython allocates a fixed C struct with exactly those fields at known memory offsets. Attribute access becomes a direct struct field lookup rather than a dictionary traversal. Per-object memory drops by 40-60%, and attribute access in tight loops gets measurably faster because LOAD_ATTR can resolve to a direct C offset rather than a hash table lookup followed by a reference dereference.
The trade-off is real and you need to know it before using __slots__ in production. You cannot add arbitrary attributes at runtime — code that does obj.unexpected_field = value raises AttributeError instead of silently creating the field. Multiple inheritance becomes restricted in specific ways that require careful design. Subclasses must also declare __slots__ or they silently get a __dict__ back and lose the memory benefit. Most ORMs and some serialisation libraries expect __dict__ on instances.
Beyond __slots__, generator pipelines are the other high-impact memory pattern. A list comprehension over 10M records materialises all 10M results in heap memory simultaneously. A generator expression over the same data yields one result at a time — constant memory regardless of dataset size. For data pipelines that process records larger than available RAM, generators are not an optimisation, they are a correctness requirement.
- __slots__ prevents adding arbitrary attributes at runtime — code that does obj.new_field = value raises AttributeError rather than silently creating the attribute
- Multiple inheritance with __slots__ is restricted — if two parent classes both define non-empty __slots__, you get a TypeError on class definition; design your hierarchy accordingly
- Subclasses must explicitly define __slots__ to retain the memory benefit — a subclass with no __slots__ declaration silently gets a __dict__ back
- Some libraries expect __dict__ on instances — pydantic v1, certain ORM frameworks, and some serialisation tools; test compatibility before shipping
- Rule: use __slots__ for data-carrying classes (DTOs, response models, graph nodes) instantiated at scale — skip it for service objects, singletons, and classes that need dynamic attribute assignment
str.join() or io.StringIO — string += in a loop is O(n²) because each concatenation copies all previous content into a new objectThe GIL, Concurrency and When to Go Parallel
CPython's Global Interpreter Lock is a mutex on the Python interpreter itself. Only one thread holds the GIL at any instant, which means only one thread executes Python bytecode at any instant. This is what makes CPython's memory management (reference counting) thread-safe without fine-grained per-object locking — but it is also what makes threading useless for CPU-bound parallelism.
The critical nuance is that the GIL is not held continuously. It is released during I/O operations — network calls, file reads, database queries, socket operations. Any time Python code does os.read() or socket.recv(), the GIL is released while the OS handles the I/O, allowing other threads to execute. This is why threading works well for I/O-bound workloads and why it is entirely wrong to say 'Python threading does nothing' — it does nothing for CPU-bound work, but it helps with I/O-bound work.
The GIL is also released by native extensions that explicitly drop it during computation. NumPy releases the GIL during most array operations, which is why NumPy computations can run in a thread without blocking other threads. Cython code that uses the nogil context manager does the same. This is a meaningful distinction: a thread running a NumPy computation and a thread handling I/O can execute genuinely in parallel, even though Python bytecode execution is serialised.
The practical decision framework is straightforward. I/O-bound work — API calls, database queries, file reads — gets asyncio or threading. CPU-bound work that is pure Python gets multiprocessing. CPU-bound work that is NumPy or Cython can use threading with the GIL released. Mixed workloads use asyncio for orchestration with run_in_executor(ProcessPoolExecutor) for the CPU-heavy segments.
The most expensive mistake I see is threading for CPU-bound pure Python work. Adding threads to a CPU-bound Python workload adds GIL contention overhead without any parallelism benefit. Two threads competing for the GIL on CPU-bound work run slower than one thread with no contention. This is not a subtle degradation — it is measurable and sometimes dramatic.
- CPU-bound threads compete for the GIL — two CPU-bound threads fight over one interpreter, adding context-switch overhead without any parallelism benefit
- I/O-bound threads release the GIL during waits — one thread waits for the network, the GIL is free, another thread runs Python bytecode concurrently
- multiprocessing bypasses the GIL entirely — each process has its own Python interpreter with its own GIL; true CPU parallelism across cores
- NumPy and Cython bypass the GIL during native computation — they explicitly release it, so NumPy operations in a thread do not block other threads from running Python code
- Rule: if your threaded code is not faster than single-threaded, check whether the work is CPU-bound; if it is, you are experiencing GIL contention, not parallelism
Vectorisation: From Python Loops to Native C Speed
Vectorisation is the practice of replacing Python-level loops with operations on typed, contiguous arrays executed by optimised C or Fortran routines. The performance difference is not incremental — it is architectural. A Python loop processing 10 million floating-point numbers involves reference counting on every value, type checking on every operation, and dynamic dispatch on every arithmetic expression. NumPy pushes the entire computation into a tight C loop that operates on raw typed memory with no per-element Python overhead. The gap is typically 100-500x for simple arithmetic on large arrays.
The reason NumPy is fast is not mysterious: it is because the loop runs in C on contiguous memory that fits in CPU caches. Python loops have poor cache behaviour because Python objects are pointer-chased heap allocations scattered across memory. NumPy arrays are contiguous blocks of typed bytes — the CPU prefetcher can predict the access pattern and keep the data in L1 cache. Modern CPUs can also SIMD-vectorise operations on contiguous numeric arrays, applying the same operation to multiple elements per clock cycle. None of this is available to Python loops.
Vectorisation applies most cleanly to numeric workloads — data that is homogeneous in type and fits the array paradigm. For complex per-element logic with conditional branching, recursive structure, or string manipulation, vectorisation either does not apply directly or requires careful reformulation using np.where(), np.select(), or boolean masking. The rule of thumb: if the loop body is more than five lines with complex conditionals, consider Numba's @jit decorator before NumPy — Numba can JIT-compile arbitrary Python numeric functions to native machine code without requiring you to reformulate the logic as array operations.
For tabular data in Pandas, the vectorisation principle applies to the method selection: use built-in Pandas methods (groupby, rolling, str.contains) rather than .apply() wherever possible. DataFrame.apply() falls back to a Python-level loop, which eliminates the Pandas performance advantage. When you must use .apply(), Numba can sometimes accelerate it via the engine='numba' parameter.
- Complex branching logic with per-element conditionals often cannot be cleanly expressed as array operations —
np.where()handles simple cases but deep conditional trees resist vectorisation - Recursive algorithms (tree traversal, graph search, recursive descent parsing) do not map to the array paradigm at all
- String processing with complex regex or context-dependent parsing — NumPy is numeric-first; Pandas string methods help for tabular string data but are not universally faster
- Small datasets under roughly 10K elements — the fixed overhead of NumPy array creation and dispatch can exceed the loop time; benchmark before committing
- Rule: if the loop body has straightforward arithmetic on numeric data at scale, vectorise. If it has complex logic or recursion, consider Numba @jit which compiles arbitrary Python to native code without requiring reformulation as array operations.
Production Patterns: Caching, Lazy Evaluation and Profiling in CI
The most impactful performance optimisation in production is often not making code faster — it is making it run less. Caching eliminates redundant computation entirely. A function that takes 50ms and is called a thousand times per second with the same arguments takes 50ms once and zero milliseconds 999 times with a properly sized cache. That is the highest possible return on investment for any performance work.
functools.lru_cache is the in-process standard for pure functions with repeated inputs. It stores results in a hash table keyed by the function arguments and returns cached results in O(1). The critical production detail is maxsize — lru_cache without maxsize is an unbounded memory allocation that will grow until the process is OOM-killed in a long-running service. Always set maxsize and monitor cache_info().currsize and cache_info().misses to verify the cache is sized correctly for the key space.
Local variable caching in hot loops is the other pattern that gives disproportionate returns for negligible risk. Python bytecode uses LOAD_FAST for local variables — a direct C array index into the frame's local variable table. LOAD_ATTR for instance attribute access is a dictionary lookup: hash the attribute name, find the slot in __dict__, dereference the pointer. LOAD_GLOBAL for module-level names is similar. In a loop that iterates a million times, caching obj.method as a local variable before the loop replaces a million dictionary lookups with a million C array accesses. The speedup is typically 10-30%, the code change is one line, and the risk is zero.
Embedding profiling into your CI pipeline is the pattern that prevents incidents rather than responding to them. A cProfile report generated on every PR that modifies a data processing function, compared against a stored baseline, catches performance regressions in code review. A function that processes a million records in 0.4 seconds regressing to 4 seconds is caught before it merges — not six weeks later when the dataset has grown and the SLA is missed at midnight.
- lru_cache without maxsize is an unbounded memory allocation — in a long-running service it will grow until OOM; always set maxsize and monitor currsize
- Caching mutable objects (dicts, lists) returns the same object reference to every caller — one caller mutating the cached result corrupts the cache for all subsequent callers; cache immutable values or deep copies
- Redis and network caches add 0.5-5ms of latency per access — only worth it if the computation cost is substantially higher than the round-trip time
- Cache stampede: when a popular cache entry expires, all concurrent requests simultaneously find a miss and recompute — use probabilistic early expiration or a lock-and-recompute pattern for high-traffic entries
- Rule: monitor hit ratio continuously — a hit ratio below 80% means the maxsize is too small for the key space, or the cache key is poorly chosen and many unique keys share similar data
functools.lru_cache vs Custom Caching: When to Use Which
Caching in Python production systems falls into two broad categories: the built-in functools.lru_cache and custom caching solutions (in-memory dicts, Redis, Memcached, database-driven). The right choice depends on your requirements for invalidation, distribution, memory management, and data freshness.
The table below compares the different caching strategies across the dimensions that matter in production.
``text Requirement | functools.lru_cache | In-Memory Dict Cache | Redis/Memcached -------------------------------------|---------------------|------------------------|------------------------- Setup complexity | One decorator | Manual implementation | Requires infrastructure Eviction policy | LRU (maxsize) | Manual (TTL, size) | LRU, TTL, LFU Automatic key invalidation | Only via maxsize LRU| Must implement manually | TTL expiry, explicit delete Distributed across processes | No (per-process) | No (per-process) | Yes (shared across processes) Memory bounds | maxsize parameter | Must implement size cap | Configurable via config Cache stampede protection | No | Must implement | Some (lock patterns) Dependency invalidation | Not supported | Must implement | Must implement Serialisation overhead | None (in-memory) | None (in-memory) | Yes (pickle/json) Suitability for pure functions | Excellent | Good | Overkill if not shared ``
When to use functools.lru_cache: - The function is pure (same arguments → same result, no side effects). - The function is called repeatedly with the same arguments within a single process. - The total number of unique argument combinations fits within a reasonable memory budget (set maxsize). - You don't need time-based expiry or distributed invalidation. - The cache key is easy to derive from the function arguments (they must be hashable).
When to use a custom in-memory dict cache: - You need TTL-based expiry (e.g., cache for 5 minutes, then recompute). - You need to invalidate specific keys based on external events. - The function arguments are not hashable (e.g., lists or dicts that need custom key logic). - You need to store more than lru_cache's argument tuple can handle (very large arguments).
When to use Redis/Memcached: - The cache must be shared across multiple processes or services. - The data must survive application restarts. - You need atomic operations (increment, lock) as part of the caching pattern. - You need fine-grained TTLs on a per-key basis. - The caching logic is part of a distributed system with multiple consumers.
The most common mistake is using lru_cache when the function depends on external state (database queries, timestamps, user context) — the cache serves stale data silently. The second most common mistake is not setting maxsize and letting the cache grow unboundedly. Always start with lru_cache for pure functions with bounded key spaces, and move to custom caching only when profiling shows it's necessary.
- Functions that read from database, file system, or external APIs — the data can change between calls; use TTL-based caching instead
- Functions that depend on the current time, random values, or user session context — the result is different for every call; do not cache at all
- Functions with unhashable arguments — lru_cache requires arguments to be hashable (tuple, frozenset, not list, dict, set); convert to hashable versions or use a custom cache with string keys
- Functions that have side effects — lru_cache caches the result but also skips the side effect; this can break logging, counters, metrics incrementation
- Rule: only use lru_cache for functions that are deterministic, have no side effects, and whose arguments are hashable. For everything else, use custom caching with explicit invalidation.
threading.Lock or functools.lru_cache's thread-safe implementation. Writing a cache without locks is a source of race conditions that are extremely hard to reproduce in development but cause intermittent production issues.functools.lru_cache for pure functions with bounded, hashable arguments that are called repeatedly within the same process. Move to custom caching with TTL and distributed storage only when requirements demand cross-process sharing, time-based expiry, or manual invalidation. Always set maxsize and monitor hit ratio.The 14-Hour Pipeline: How a Single Unoptimised Loop Cost $180K in Compute
enrich_records(). This function iterated over every record and, for each one, performed a membership check against a 200K-entry reference table using a linear scan — essentially for key in ref_table: if key == record_key. On 50M records against 200K entries, this was 10 trillion comparisons in the worst case. The complexity was O(n*m) where both n and m grew as the business scaled. The team had never noticed because the function worked correctly on small development datasets where it completed in seconds.- Profile before you optimise — the bottleneck is almost never where you think it is, and the team spent two weeks on the wrong layer entirely
- O(n*m) patterns on large datasets are silent killers — they perform acceptably on development data and degrade quadratically as production data grows
- Adding workers to an algorithmic bottleneck is throwing money at the wrong layer — more parallelism on an O(nm) algorithm just runs more O(nm) operations simultaneously
- Instrument pipelines with profiling in CI — a performance regression caught in code review costs nothing; one caught three months into production costs $180K
tracemalloc.take_snapshot() at two points separated by several minutes of production traffic and compare the statistics. Look for object types that accumulate between snapshots without being freed. Check for circular references (gc.garbage), unclosed file handles, and growing caches with no eviction policy. Large lru_cache instances without maxsize are a common source.gc.set_threshold() based on actual collection frequency data, not guesses.Key takeaways
Common mistakes to avoid
6 patternsOptimising without profiling first
Using threading for CPU-bound work
Using list comprehensions where generators suffice
String concatenation in loops with the += operator
Not using __slots__ for high-volume data classes
Importing heavy modules at module level instead of lazily
importlib.import_module() for conditional imports based on runtime configuration. For CLI tools, defer all heavy imports until after argument parsing — the startup latency difference is immediately noticeable to users.Interview Questions on This Topic
Explain the Global Interpreter Lock (GIL) and its impact on Python concurrency. When does it matter, and when is it irrelevant?
Frequently Asked Questions
That's Advanced Python. Mark it forged?
15 min read · try the examples if you haven't