NumPy Performance Tips — Vectorisation vs Loops
- Python loops over NumPy arrays pay ~300ns per element in interpreter overhead — a fixed tax independent of operation complexity. Vectorised ufuncs pay ~0.2ns per element in compiled C. The 150x gap is structural, not incidental.
- np.vectorize is API convenience with zero performance benefit — it runs a Python loop. The name is misleading. Benchmark it against the raw loop to confirm they match, then use the right tool instead.
- Most loop patterns have a direct NumPy replacement: np.diff for adjacent differences, np.cumsum for running totals, np.clip for clamping, np.where for conditionals. The rolling window pattern — the most common hidden bottleneck — is vectorisable using the prefix sum trick.
- A Python loop over a NumPy array is 10–100x slower than a vectorised operation — each iteration pays interpreter overhead
- np.vectorize() is NOT a performance tool — it wraps a Python loop in a convenience API, same speed
- Most loop patterns have a NumPy equivalent: np.diff, np.cumsum, np.clip, broadcasting, boolean masking
- Vectorised operations run in compiled C with no per-element type checks or reference counting
- For recurrences that cannot be vectorised, Numba @njit compiles to machine code — 50–200x faster than Python loops
- Biggest mistake: reaching for np.vectorize expecting speed — it is syntax sugar, not optimisation
Python for-loop iterating over a NumPy array element by element
%timeit -n 3 -r 5 [x * 2 for x in arr]%timeit -n 3 -r 5 arr * 2np.vectorize wrapping a Python function — no speedup observed, benchmark matches raw loop
from numba import njit
import numpy as np
@njit
def clip_arr(arr, lo, hi):
return np.minimum(np.maximum(arr, lo), hi)%timeit clip_arr(arr, 0.0, 1.0) # after first-call warmupLoop computes rolling window statistics — mean, std, z-score — per element
cumsum = np.cumsum(np.insert(arr, 0, 0))
rolling_sum = cumsum[window:] - cumsum[:-window]rolling_mean = rolling_sum / windowLoop applies a conditional operation — if/else branch per element
result = np.where(arr > threshold, arr * 2, arr / 2)mask = arr > threshold
result = np.empty_like(arr)
result[mask] = arr[mask] * 2
result[~mask] = arr[~mask] / 2Production Incident
window_data.mean()) / window_data.std()
result[i] = z
Each iteration sliced a 100-element subarray (allocating a new ndarray), called .mean() and .std() on that slice (two more C calls with Python overhead on each), computed the z-score as a Python float, and stored the result. With a 50M-row dataset and window=100, this executed 50M Python-level loop iterations with approximately 200M temporary array allocations across the full run.
The 16-core upgrade made no difference because the GIL serialises Python thread execution. Every iteration was inherently single-threaded. More cores did not accelerate a single-threaded Python loop — they just sat idle while one core did all the work.Production Debug GuideSymptoms that indicate a Python loop is your bottleneck — and what to verify before reaching for more hardware
The Cost of Python Loops
Every iteration of a Python for-loop over a NumPy array pays a fixed interpreter overhead that has nothing to do with the operation you're trying to compute. The runtime pays for: extracting the element from the array buffer and converting it to a Python float object (boxing), executing the loop body in Python bytecode, managing the reference count on every temporary object, and deallocating those objects at the end of each iteration.
This overhead is approximately 300 nanoseconds per element. It doesn't scale with the complexity of the operation — a simple addition and a complex trigonometric function each cost roughly the same interpreter overhead per iteration. The actual computation is a small fraction of the total cost for simple operations.
A vectorised call like arr.sum() takes a completely different path. NumPy passes a raw C pointer directly to a compiled function that processes the entire array using SIMD (Single Instruction, Multiple Data) instructions. Modern CPUs with AVX2 support process 4 double-precision floats per instruction. There is no Python overhead per element — zero boxing, zero reference counting, zero bytecode interpretation.
The overhead isn't proportional to the work per element — it's a fixed tax. For a trivial operation like addition, the Python overhead is roughly 99% of the total runtime. For a genuinely complex per-element computation, the C code dominates and the Python overhead shrinks in proportion. This is why the gap is largest for simple element-wise operations (100–150x) and smaller for complex per-element work.
In production pipelines, this overhead compounds badly. A 50M-element array processed in a Python loop at 300ns per element takes 15 seconds. The same array processed with a vectorised ufunc takes roughly 100 milliseconds. That's a 150x difference between a pipeline that meets its SLA and one that blows through it by 45 minutes.
# io.thecodeforge.numpy.loop_cost # Demonstrates the interpreter overhead cost of Python loops vs vectorised operations # Run with: python loop_cost.py or in a Jupyter notebook with %timeit import numpy as np import time arr = np.random.randn(1_000_000) # 1M elements, float64 # ---- Python loop: pays interpreter overhead on every element ---- start = time.perf_counter() total_loop = 0.0 for x in arr: total_loop += x elapsed_loop = time.perf_counter() - start print(f'Python loop: {elapsed_loop:.3f}s (result={total_loop:.4f})') # ---- Vectorised: hands a C pointer to a compiled SIMD loop ---- start = time.perf_counter() total_vec = arr.sum() elapsed_vec = time.perf_counter() - start print(f'NumPy vectorised: {elapsed_vec:.4f}s (result={total_vec:.4f})') print(f'Speedup: {elapsed_loop / elapsed_vec:.0f}x') # Typical output on a modern CPU: # Python loop: 0.287s # NumPy vectorised: 0.0019s # Speedup: ~150x # ---- Per-element overhead breakdown ---- # Python loop: ~300ns per element # Boxing scalar to Python float: ~80ns # Bytecode execution (BINARY_ADD): ~100ns # Reference count increment: ~40ns # Temporary object deallocation: ~50ns # Loop bookkeeping (JUMP_ABSOLUTE): ~30ns # # Vectorised: ~0.2ns per element # Raw memory access from C pointer: ~0.1ns # SIMD add (AVX2 processes 4 doubles per instruction): ~0.1ns # Zero Python overhead per element # ---- The overhead is fixed, not proportional ---- # For arr + 1 (trivial), interpreter overhead is 99% of runtime # For np.sin(x) (complex), overhead shrinks because C computation dominates # Simple operation: loop vs vectorised gap is largest start = time.perf_counter() result_loop = np.array([x + 1.0 for x in arr]) elapsed_add_loop = time.perf_counter() - start start = time.perf_counter() result_vec = arr + 1.0 elapsed_add_vec = time.perf_counter() - start print(f'\nElement-wise add:') print(f' Loop: {elapsed_add_loop:.3f}s') print(f' Vectorised: {elapsed_add_vec:.4f}s') print(f' Speedup: {elapsed_add_loop / elapsed_add_vec:.0f}x') # Speedup for addition is typically 100-150x — overhead dominates # Complex operation: overhead shrinks in proportion start = time.perf_counter() result_sin_loop = np.array([float(np.sin(x)) for x in arr]) elapsed_sin_loop = time.perf_counter() - start start = time.perf_counter() result_sin_vec = np.sin(arr) elapsed_sin_vec = time.perf_counter() - start print(f'\nElement-wise np.sin:') print(f' Loop: {elapsed_sin_loop:.3f}s') print(f' Vectorised: {elapsed_sin_vec:.4f}s') print(f' Speedup: {elapsed_sin_loop / elapsed_sin_vec:.0f}x') # Speedup for sin is typically 10-20x — C computation now significant
- Python loop: ~300ns per element — type check, boxing, bytecode execution, reference counting, deallocation, all for every single element
- Vectorised call: ~0.2ns per element — raw C pointer passed to compiled code, SIMD processes 4–8 elements per CPU cycle
- The tax is fixed: a trivial addition and a complex computation cost the same ~300ns in Python overhead
- The gap closes for complex per-element work where the C computation itself takes time comparable to the Python overhead
- SIMD (AVX2) processes 4 float64 values per instruction — vectorised code gets parallelism within a single core that Python loops structurally cannot
Replacing Common Loop Patterns
Most Python loops over NumPy arrays are not doing anything fundamentally sequential. They just look sequential because element-by-element thinking is the default mental model. The patterns that appear most often in production codebases all have direct vectorised replacements — once you learn to recognise the pattern, the replacement is mechanical.
Element-wise arithmetic is the simplest: any loop that computes output[i] = f(input[i]) using arithmetic operators is directly replaceable with broadcasting. NumPy broadcasts scalar operations across the entire array in a single compiled call.
np.diff handles adjacent differences: any loop computing output[i] = arr[i+1] - arr[i] is a np.diff(arr) call.
np.cumsum and np.cumprod handle running totals and products. They also enable the rolling window trick, which is the most impactful pattern to recognise.
np.clip handles value clamping without a conditional per element. np.where handles two-branch conditionals. np.select handles multiple branches.
The rolling window pattern deserves particular attention because it appears constantly in time-series and signal processing code and is non-obvious to vectorise at first glance. The key insight is that a rolling sum of window W at position i is: sum(arr[i-W:i]) = cumsum[i] - cumsum[i-W]. If you have the prefix sum array precomputed, every window sum is a single subtraction — O(1) per element, no inner loop, no temporary array allocation. Rolling mean follows immediately. Rolling variance uses the computational formula Var(X) = E[X²] - E[X]², decomposed the same way with a second prefix sum over the squared values.
# io.thecodeforge.numpy.patterns # Direct replacements for the most common Python loop patterns over NumPy arrays # Each pattern shown as: loop version (slow) -> vectorised version (fast) import numpy as np arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]) # ---- Pattern 1: Adjacent differences ---- # Loop (slow): diffs_loop = [arr[i+1] - arr[i] for i in range(len(arr) - 1)] # Vectorised: diffs = np.diff(arr) print(f'np.diff: {diffs}') # [1. 1. 1. 1. 1. 1. 1.] # ---- Pattern 2: Cumulative sum ---- # Loop (slow): cumulative_loop = [sum(arr[:i+1]) for i in range(len(arr))] # Vectorised: cumulative = np.cumsum(arr) print(f'np.cumsum: {cumulative}') # [ 1. 3. 6. 10. 15. 21. 28. 36.] # ---- Pattern 3: Value clamping ---- # Loop (slow): clipped_loop = [min(max(x, 2.0), 6.0) for x in arr] # Vectorised: clipped = np.clip(arr, 2.0, 6.0) print(f'np.clip: {clipped}') # [2. 2. 3. 4. 5. 6. 6. 6.] # ---- Pattern 4: Conditional assignment — two branches ---- # Loop (slow): result_loop = np.empty_like(arr) for i in range(len(arr)): if arr[i] > 4.0: result_loop[i] = arr[i] * 10 else: result_loop[i] = arr[i] * 0.5 # Vectorised with np.where: result_where = np.where(arr > 4.0, arr * 10, arr * 0.5) print(f'np.where: {result_where}') # [ 0.5 1. 1.5 2. 50. 60. 70. 80. ] # Vectorised with boolean masking (preferred when branches have very different costs): # Boolean masking only evaluates the branch for the subset that matches result_mask = np.empty_like(arr) mask = arr > 4.0 result_mask[mask] = arr[mask] * 10 result_mask[~mask] = arr[~mask] * 0.5 print(f'bool mask: {result_mask}') # same result # Note: np.where(cond, x, y) evaluates BOTH x and y for all elements, # then selects. Boolean masking evaluates each branch only for the matching # subset — more efficient when one branch is expensive and rarely true. # ---- Pattern 5: Multiple branches — np.select ---- # Extends np.where to N conditions conditions = [arr < 2.0, (arr >= 2.0) & (arr < 5.0), arr >= 5.0] choices = [arr * 0.0, arr * 1.0, arr * 2.0] # zero, keep, double result_select = np.select(conditions, choices, default=arr) print(f'np.select: {result_select}') # [0. 0. 3. 4. 10. 12. 14. 16.] # ---- Pattern 6: Rolling window statistics (the key production pattern) ---- # Loop (slow) — creates 1 temporary array per iteration: # for i in range(window, len(arr)): # rolling_mean[i] = arr[i-window:i].mean() def rolling_stats(arr, window): """Rolling mean and std using cumulative sums. O(n), no Python loop.""" # Pad with zero at position 0 so indexing arithmetic works cleanly cs = np.cumsum(np.insert(arr, 0, 0.0)) # prefix sum cs2 = np.cumsum(np.insert(arr**2, 0, 0.0)) # prefix sum of squares rolling_sum = cs[window:] - cs[:-window] rolling_sum2 = cs2[window:] - cs2[:-window] mean = rolling_sum / window # Computational formula: Var = E[X^2] - E[X]^2 # Clip to 0 to avoid negative values from floating-point precision loss var = np.maximum(rolling_sum2 / window - mean**2, 0.0) std = np.sqrt(var) return mean, std mean, std = rolling_stats(arr, window=3) print(f'Rolling mean (w=3): {mean}') # [2. 3. 4. 5. 6. 7.] print(f'Rolling std (w=3): {std}') # [~0.816 ~0.816 ...] # ---- Pattern 7: Normalisation across rows of a 2D array ---- # Loop (slow): matrix = np.random.randn(1000, 50) # 1000 rows, 50 features norm_loop = np.empty_like(matrix) for i in range(len(matrix)): norm_loop[i] = (matrix[i] - matrix[i].mean()) / matrix[i].std() # Vectorised with broadcasting — keepdims preserves shape for broadcasting: means = matrix.mean(axis=1, keepdims=True) # shape (1000, 1) stds = matrix.std(axis=1, keepdims=True) # shape (1000, 1) norm_vec = (matrix - means) / stds # broadcasts across 50 columns print(f'Row normalisation matches: {np.allclose(norm_loop, norm_vec)}')
- Element-wise math: output[i] = f(arr[i]) — use arr * scale, arr + offset, np.exp(arr), etc. Broadcasting handles scalar and array operands.
- Adjacent differences: arr[i+1] - arr[i] — use np.diff(arr). For higher-order differences, np.diff(arr, n=2) gives second differences.
- Cumulative operations: running total or product — use np.cumsum(arr) or np.cumprod(arr). Both O(n), no loop.
- Value clamping: min(max(x, lo), hi) — use np.clip(arr, lo, hi). Handles NaN-safe behavior with np.nanmin/nanmax variants.
- Two-branch conditional: if/else per element — use np.where(condition, x, y). For expensive branches, prefer boolean masking to avoid evaluating both branches.
- Rolling window stats: sliding mean, std, variance — use the cumulative sum trick with prefix sums. One np.cumsum call replaces the entire inner loop.
- Row/column operations on 2D arrays: normalise, scale, shift per row — use axis parameter and keepdims=True for broadcasting. No explicit loop over rows.
When You Cannot Avoid a Loop — np.vectorize vs Numba
Some operations genuinely cannot be vectorised because each output depends on a previously computed output — not on the input at the same position. Running maximums, exponential smoothing, Fibonacci-style recurrences, and certain signal filters all have this structure. These are not problems of pattern recognition — they are structurally sequential. The question then is not whether to loop, but where the loop should run.
np.vectorize is the most misunderstood function in NumPy. Engineers see 'vectorize' in the name and assume it implies compiled execution or parallelism. It implies neither. Reading the NumPy documentation directly: 'The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.' The implementation is a Python loop with thin type-coercion wrapping on each side. It provides convenience — you can pass array arguments to a function that expects scalars — not speed. Benchmarking np.vectorize against a hand-written Python loop reliably shows the same runtime, or slightly worse due to the extra function call overhead per element.
For genuinely sequential operations that must run fast, Numba is the correct tool. The @njit decorator ('no Python, just-in-time') compiles the decorated function to LLVM machine code on the first call. The compiled function runs with zero Python overhead — the sequential logic is preserved, but the interpreter is completely bypassed. Compilation takes roughly 0.5–2 seconds on the first call, after which the compiled version is cached. For a loop over 10M elements that previously took 3 seconds in Python, the Numba-compiled version typically runs in 20–50 milliseconds — a 60–150x improvement.
The Numba constraint is that all types must be statically inferable from the function signature. This means: no Python dicts or sets inside @njit functions, no arbitrary Python objects, and no calls to Python functions that aren't themselves @njit-compiled. For data pipeline work using NumPy arrays and scalars, this is almost never a limitation in practice.
Cython is an alternative when the team already uses it and wants a separate build step with explicit type annotations in a .pyx file. The performance outcome is comparable to Numba. Numba requires less code change — a decorator on an existing Python function versus a rewrite in Cython syntax — which makes it the right default for most teams.
# io.thecodeforge.numpy.vectorize_vs_numba # Demonstrates: # 1. np.vectorize — what it actually does (convenience, not speed) # 2. Numba @njit — genuine compiled performance for sequential logic # 3. Side-by-side benchmark showing the difference import numpy as np import time rng = np.random.default_rng(42) arr = rng.standard_normal(10_000_000).astype(np.float64) # 10M elements # ---- np.vectorize: the convenience wrapper that is NOT faster ---- # Under the hood: Python loop + per-element type coercion # Use this when you want array broadcasting on a scalar function # Do NOT use this when you want speed def clip_scalar(x, lo, hi): """Works on scalars. np.vectorize will call this once per element.""" return lo if x < lo else (hi if x > hi else x) vclip = np.vectorize(clip_scalar) # For a small demo array: small = np.array([-0.5, 0.3, 0.7, 1.5, -0.1]) print(f'np.vectorize result: {vclip(small, 0.0, 1.0)}') # [0. 0.3 0.7 1. 0. ] # Benchmark: np.vectorize vs raw loop vs numpy built-in start = time.perf_counter() _ = vclip(arr, 0.0, 1.0) time_vectorize = time.perf_counter() - start start = time.perf_counter() result_py = np.empty_like(arr) for i, x in enumerate(arr): result_py[i] = 0.0 if x < 0.0 else (1.0 if x > 1.0 else x) time_loop = time.perf_counter() - start start = time.perf_counter() _ = np.clip(arr, 0.0, 1.0) # built-in ufunc time_numpy = time.perf_counter() - start print(f'np.vectorize: {time_vectorize:.2f}s') print(f'Python loop: {time_loop:.2f}s') print(f'np.clip: {time_numpy:.4f}s') # Typical output: # np.vectorize: 4.87s <- same order as the loop, sometimes worse # Python loop: 4.61s <- the baseline we're trying to beat # np.clip: 0.031s <- 150x faster — this is what vectorisation means # ---- Numba @njit: compiled sequential logic ---- # Use this when the loop has genuine sequential dependencies # (result[i] depends on result[i-1]) and cannot be vectorised try: from numba import njit @njit def running_max(arr): """ Running maximum: result[i] = max(arr[0], arr[1], ..., arr[i]). Genuinely sequential — result[i] depends on result[i-1]. Cannot be vectorised with standard NumPy. """ result = np.empty_like(arr) result[0] = arr[0] for i in range(1, len(arr)): result[i] = result[i-1] if result[i-1] > arr[i] else arr[i] return result @njit def exponential_smooth(arr, alpha): """ Exponential weighted moving average (EWMA). result[i] = alpha * arr[i] + (1 - alpha) * result[i-1] Classic sequential recurrence — each output depends on the prior output. """ result = np.empty_like(arr) result[0] = arr[0] for i in range(1, len(arr)): result[i] = alpha * arr[i] + (1.0 - alpha) * result[i-1] return result # Warmup call — triggers LLVM compilation (~0.5-2s one-time cost) _ = running_max(arr[:100]) _ = exponential_smooth(arr[:100], 0.1) # Benchmark compiled Numba vs equivalent Python loop start = time.perf_counter() result_nb = running_max(arr) time_numba = time.perf_counter() - start # Python equivalent for comparison start = time.perf_counter() result_py2 = np.empty_like(arr) result_py2[0] = arr[0] for i in range(1, len(arr)): result_py2[i] = max(result_py2[i-1], arr[i]) time_python2 = time.perf_counter() - start print(f'\nRunning max (10M elements):') print(f' Python loop: {time_python2:.2f}s') print(f' Numba @njit: {time_numba:.4f}s') print(f' Speedup: {time_python2 / time_numba:.0f}x') print(f' Results match: {np.allclose(result_nb, result_py2)}') # Typical output: # Python loop: 3.84s # Numba @njit: 0.022s # Speedup: ~175x except ImportError: print('Numba not installed — pip install numba') print('For environments without Numba, Cython is an alternative') print('For pandas integration, consider bottleneck or scipy alternatives') # ---- Summary: what to use when ---- # np.vectorize: when you want to apply a scalar Python function element-wise # for API convenience — accept Python loop speed # np.clip / np.where / ufuncs: when the operation has a NumPy equivalent # — 100-300x faster than any loop # Numba @njit: when the loop has genuine sequential dependencies # — 50-200x faster than Python loop, near-C speed # Cython: when team already uses it and prefers explicit type annotations # — similar performance to Numba, more setup overhead
Broadcasting and Memory Layout: The Hidden Performance Factors
Vectorisation eliminates Python interpreter overhead, but two additional factors determine how fast the resulting C-compiled operations run: whether NumPy can apply broadcasting without creating intermediate copies, and whether the memory access pattern aligns with CPU cache lines.
Broadcasting lets NumPy perform operations between arrays of different shapes without allocating expanded copies. When you subtract a mean vector of shape (1000,) from a matrix of shape (1000, 50), NumPy broadcasts the subtraction without materialising a (1000, 50) copy of the mean vector. This is fast. When the operation requires an intermediate result that doesn't fit the broadcast rules, NumPy allocates a full-size temporary array — and that allocation, which happens in C but still consumes memory bandwidth, can dominate the runtime for large arrays.
Memory layout — C-order (row-major) versus Fortran-order (column-major) — matters because CPU caches load contiguous memory efficiently. A C-order NumPy array stores rows contiguously. Iterating over rows, or applying operations along axis=1, accesses contiguous memory and benefits from cache prefetching. Iterating over columns accesses every nth element in memory, causing cache misses on every access. For large arrays, cache-unfriendly access patterns can cost 5–10x in throughput versus cache-friendly access, independent of Python versus C execution.
In-place operations avoid allocation entirely. arr += 1.0 modifies the array in place and allocates no temporary. arr + 1.0 allocates a new array of the same size to hold the result, then the original may be garbage collected. For a 1GB array, that allocation difference is the difference between a 1GB and 2GB peak memory usage — which may or may not matter, but it's always worth being explicit about.
# io.thecodeforge.numpy.broadcasting_memory # Broadcasting efficiency, memory layout effects, and in-place operations # These factors affect performance after you've already eliminated the Python loop import numpy as np import time # ---- Broadcasting: no copies for shape-compatible operations ---- rng = np.random.default_rng(0) matrix = rng.standard_normal((10_000, 500)) # 10K rows, 500 columns # Row normalisation via broadcasting — no temporary (1000, 500) copy of means means = matrix.mean(axis=1, keepdims=True) # shape: (10000, 1) stds = matrix.std(axis=1, keepdims=True) # shape: (10000, 1) # The subtraction broadcasts (10000, 1) across 500 columns without materialising # a full (10000, 500) copy of the means start = time.perf_counter() norm = (matrix - means) / stds time_broadcast = time.perf_counter() - start print(f'Broadcasting normalisation: {time_broadcast:.4f}s') # Equivalent with explicit expansion — forces a copy allocation means_expanded = np.tile(means, (1, 500)) # shape: (10000, 500) — full copy start = time.perf_counter() norm_tiled = (matrix - means_expanded) / stds time_tiled = time.perf_counter() - start print(f'Tiled (explicit copy): {time_tiled:.4f}s') print(f'Broadcasting speedup: {time_tiled / time_broadcast:.1f}x') # Broadcasting is typically 1.5-3x faster because it avoids the tile allocation # ---- Memory layout: C-order vs Fortran-order ---- # C-order: rows are contiguous — efficient for row-wise operations # F-order: columns are contiguous — efficient for column-wise operations c_arr = np.ascontiguousarray(rng.standard_normal((5000, 5000))) # C-order f_arr = np.asfortranarray(c_arr.copy()) # F-order # Row sum: accessing contiguous elements in C-order start = time.perf_counter() for _ in range(5): _ = c_arr.sum(axis=1) # sum across columns per row time_c_row = (time.perf_counter() - start) / 5 start = time.perf_counter() for _ in range(5): _ = f_arr.sum(axis=1) # same operation, Fortran layout — cache-unfriendly time_f_row = (time.perf_counter() - start) / 5 print(f'\nRow sum, C-order: {time_c_row:.4f}s') print(f'Row sum, Fortran-order: {time_f_row:.4f}s') print(f'Layout speedup (row op): {time_f_row / time_c_row:.1f}x') # C-order is typically 2-5x faster for row operations due to cache locality # Verify memory layout: print(f'\nc_arr is C-contiguous: {c_arr.flags["C_CONTIGUOUS"]}') print(f'f_arr is F-contiguous: {f_arr.flags["F_CONTIGUOUS"]}') # ---- In-place operations: avoid intermediate allocations ---- large = rng.standard_normal(10_000_000).astype(np.float64) # Out-of-place: allocates a 80MB temporary array start = time.perf_counter() result_oop = large + 1.0 # allocates new array time_oop = time.perf_counter() - start # In-place: no allocation — modifies large directly large_copy = large.copy() # preserve original for fair comparison start = time.perf_counter() large_copy += 1.0 # modifies in place, no allocation time_inplace = time.perf_counter() - start print(f'\nOut-of-place (large + 1.0): {time_oop:.4f}s') print(f'In-place (large += 1.0): {time_inplace:.4f}s') print(f'In-place speedup: {time_oop / time_inplace:.1f}x') # In-place is typically 1.5-2x faster for large arrays # because it avoids memory allocation and halves the memory bandwidth required # ---- Practical rule: check array flags before assuming layout ---- print(f'\nCheck any array layout with arr.flags:') print(large.flags)
- Unnecessary intermediate copies: np.tile, np.repeat, and explicit reshapes before broadcasting can force full-size copies that broadcasting itself would avoid. Use keepdims=True and let NumPy broadcast.
- Cache-unfriendly access patterns: column-wise operations on C-order arrays (or row-wise on Fortran-order) cause cache misses on every access. Use arr.T for efficient column operations, or ensure your array layout matches your primary access pattern.
- Out-of-place vs in-place: arr = arr + 1.0 allocates a new array; arr += 1.0 does not. For arrays larger than ~100MB, this allocation is a measurable fraction of runtime and doubles peak memory usage.
- Check array contiguity with arr.flags before optimising memory access patterns — np.ascontiguousarray(arr) forces C-order layout if the array is not already contiguous.
- np.einsum is often faster than chained operations for complex tensor contractions — it fuses operations and avoids intermediate allocations that separate np.dot and np.sum calls would create.
| Method | Speedup vs Python Loop | When to Use | Trade-off |
|---|---|---|---|
| Vectorised ufunc / broadcasting | 100–300x | Element-wise arithmetic, comparisons, math functions, aggregations — any operation where output[i] depends only on input[i] or the whole array | Requires expressing the logic as NumPy operations. Not all logic translates directly, but most element-wise operations do. |
| Boolean masking / np.where / np.select | 100–200x | Conditional assignment per element — if/else logic, value replacement, multi-branch selection | np.where evaluates both branches for all elements before selecting — use boolean masking when one branch is expensive and rarely true, to evaluate only the matching subset. |
| np.diff / np.cumsum / np.clip / rolling cumsum trick | 50–150x | Adjacent differences, running totals, value clamping, rolling window statistics — specific mathematical patterns with direct NumPy equivalents | Only applies when the operation fits one of these patterns. Rolling cumsum only works for associative operations (sum, sum of squares) — not median, quantile, or mode. |
| Numba @njit | 50–200x | Sequential dependencies (result[i] uses result[i-1]), recurrences, exponential smoothing, running statistics that require prior output values | One-time LLVM compilation cost of 0.5–2s on first call. Types must be statically inferable — no arbitrary Python objects inside @njit functions. Numba must be installed separately. |
| np.vectorize | 0–5% (negligible, sometimes negative) | API convenience only — applying a scalar Python function to an array input when you do not care about performance and want to avoid writing the loop yourself | No performance benefit at all. The name is misleading. Benchmark against the raw loop before using — the runtimes will match. Never use this expecting speedup. |
| Cython | 50–200x | Large production codebases that already use Cython, or when explicit C-level type annotations are preferred over Numba's type inference | Requires separate .pyx files, explicit type annotations, and a build step integrated into the package setup. Higher setup overhead than Numba for new code. |
| In-place operations + keepdims broadcasting | 1.5–3x (additive after other optimisations) | Large arrays (>100MB) where memory allocation time is a measurable fraction of total operation time | Modifies the original array — requires explicit .copy() if the original must be preserved. Cannot always replace out-of-place operations when intermediate results are needed. |
🎯 Key Takeaways
- Python loops over NumPy arrays pay ~300ns per element in interpreter overhead — a fixed tax independent of operation complexity. Vectorised ufuncs pay ~0.2ns per element in compiled C. The 150x gap is structural, not incidental.
- np.vectorize is API convenience with zero performance benefit — it runs a Python loop. The name is misleading. Benchmark it against the raw loop to confirm they match, then use the right tool instead.
- Most loop patterns have a direct NumPy replacement: np.diff for adjacent differences, np.cumsum for running totals, np.clip for clamping, np.where for conditionals. The rolling window pattern — the most common hidden bottleneck — is vectorisable using the prefix sum trick.
- For loops with genuine sequential dependencies (result[i] depends on result[i-1]), use Numba @njit — it compiles the exact same logic to machine code with 50–200x speedup over the Python loop while preserving the sequential structure.
- Scaling hardware does not fix algorithmic inefficiency — the GIL serialises Python loops regardless of core count. Profile first, identify the actual bottleneck, then vectorise or compile before reaching for a larger instance.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhy is iterating over a NumPy array with a Python for loop slow?JuniorReveal
- QWhat does np.vectorize actually do under the hood?JuniorReveal
- QA feature engineering pipeline processes a 50M-row dataset with a rolling window loop. The loop computes rolling mean and std for a window of 100. How would you optimise this?SeniorReveal
- QWhen would you choose Numba @njit over a vectorised NumPy approach, and what are the practical limitations?SeniorReveal
Frequently Asked Questions
Is np.vectorize() faster than a Python loop?
No — or so marginally that the difference is noise. The NumPy documentation states directly that np.vectorize is 'provided primarily for convenience, not for performance.' It calls your Python function once per element in a Python loop, adding thin type-coercion handling on each side. Benchmarking consistently shows np.vectorize matching or slightly underperforming a hand-written Python loop on the same operation.
Use np.vectorize when you want the convenience of applying a scalar Python function to array inputs without writing the loop yourself, and when you have no performance requirement. Do not use it when you expect speedup — you will be disappointed and confused.
What is the fastest way to apply a custom function to each element of a NumPy array?
It depends on what the function does.
If the function can be expressed using NumPy operations — arithmetic, comparisons, math functions, boolean logic — express it that way directly. This is the fastest option: the loop runs in compiled C with SIMD and there is no Python overhead per element.
If the function cannot be expressed with NumPy operations because it has sequential dependencies, use Numba: decorate with @numba.njit and it compiles to LLVM machine code on the first call. Subsequent calls run at near-C speed. This is the right answer for recurrences, state machines, and operations where output[i] depends on output[i-1].
If the function calls external Python code or libraries that Numba cannot compile, there is no way to eliminate the per-element Python overhead. In that case, focus on batching the calls to reduce the per-element fixed cost, or use multiprocessing if each element's computation is large enough to justify serialization overhead.
Can I parallelise a Python loop over a NumPy array using multiprocessing?
Technically yes, practically rarely the right answer. The serialisation cost of passing large NumPy arrays between processes — pickling, copying, unpickling — often consumes more time than the computation being parallelised. For arrays above ~10MB, the IPC overhead typically dominates.
For CPU-bound element-wise operations, vectorisation or Numba @njit is faster and simpler because the computation stays in a single process with no serialisation overhead.
Multiprocessing earns its place when each element requires significant independent computation that dwarfs the serialisation cost — for example, running ML model inference or a complex simulation per element, where each call takes tens of milliseconds. In that scenario, the per-element work dominates and parallelism across processes provides real throughput improvement. Use concurrent.futures.ProcessPoolExecutor for this pattern and benchmark with actual array sizes before committing to the architecture.
How do I know if my loop can be vectorised?
Ask one question: does computing result[i] require the value of result[i-1] or any earlier output?
If the answer is no — result[i] depends only on input values, not on prior outputs — the loop can be vectorised. The computation at each position is independent of all others, which is the only structural requirement for vectorisation. Find the NumPy equivalent (broadcasting, ufunc, np.where, cumsum trick) and replace the loop.
If the answer is yes — computing result[i] requires result[i-1] — the loop has a sequential dependency and cannot be vectorised with standard NumPy. Use Numba @njit to preserve the sequential structure while compiling it to machine code.
A common source of confusion: loops that read from input[i-1] but write to result[i] are NOT sequential in the relevant sense. Reading from the input array at any position is fine — the input doesn't change. The constraint is only on reading from previously computed output values.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.