Skip to content
Home Python NumPy Performance Tips — Vectorisation vs Loops

NumPy Performance Tips — Vectorisation vs Loops

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Python Libraries → Topic 32 of 51
Why NumPy loops are slow and how to vectorise them.
🔥 Advanced — solid Python foundation required
In this tutorial, you'll learn
Why NumPy loops are slow and how to vectorise them.
  • Python loops over NumPy arrays pay ~300ns per element in interpreter overhead — a fixed tax independent of operation complexity. Vectorised ufuncs pay ~0.2ns per element in compiled C. The 150x gap is structural, not incidental.
  • np.vectorize is API convenience with zero performance benefit — it runs a Python loop. The name is misleading. Benchmark it against the raw loop to confirm they match, then use the right tool instead.
  • Most loop patterns have a direct NumPy replacement: np.diff for adjacent differences, np.cumsum for running totals, np.clip for clamping, np.where for conditionals. The rolling window pattern — the most common hidden bottleneck — is vectorisable using the prefix sum trick.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • A Python loop over a NumPy array is 10–100x slower than a vectorised operation — each iteration pays interpreter overhead
  • np.vectorize() is NOT a performance tool — it wraps a Python loop in a convenience API, same speed
  • Most loop patterns have a NumPy equivalent: np.diff, np.cumsum, np.clip, broadcasting, boolean masking
  • Vectorised operations run in compiled C with no per-element type checks or reference counting
  • For recurrences that cannot be vectorised, Numba @njit compiles to machine code — 50–200x faster than Python loops
  • Biggest mistake: reaching for np.vectorize expecting speed — it is syntax sugar, not optimisation
🚨 START HERE
NumPy Loop Optimisation Quick Reference
Fast diagnostic and fix guide for slow NumPy loops — benchmark first, then pick the right replacement
🟡Python for-loop iterating over a NumPy array element by element
Immediate ActionVectorise — express as a NumPy ufunc or broadcasting expression. The speedup is typically 100–150x for simple arithmetic.
Commands
%timeit -n 3 -r 5 [x * 2 for x in arr]
%timeit -n 3 -r 5 arr * 2
Fix NowReplace with arr * 2 — NumPy broadcasts the scalar multiply across the entire array in compiled C. For conditional operations, use np.where(condition, arr * 2, arr). Both run with zero Python per-element overhead.
🟡np.vectorize wrapping a Python function — no speedup observed, benchmark matches raw loop
Immediate Actionnp.vectorize is not an optimisation — this is the expected result. Replace with Numba @njit for compiled performance, or express the logic as a NumPy expression using boolean masking or np.where.
Commands
from numba import njit import numpy as np @njit def clip_arr(arr, lo, hi): return np.minimum(np.maximum(arr, lo), hi)
%timeit clip_arr(arr, 0.0, 1.0) # after first-call warmup
Fix NowNumba compiles to LLVM machine code on first call — subsequent calls run at near-C speed. For simple clipping specifically, np.clip(arr, lo, hi) is already a C-compiled ufunc and is the right answer here.
🟡Loop computes rolling window statistics — mean, std, z-score — per element
Immediate ActionReplace with the cumulative sum vectorisation. Rolling mean and variance decompose into O(1) lookups per element using prefix sums — no Python loop required.
Commands
cumsum = np.cumsum(np.insert(arr, 0, 0)) rolling_sum = cumsum[window:] - cumsum[:-window]
rolling_mean = rolling_sum / window
Fix NowExtend to rolling variance with: cumsum2 = np.cumsum(np.insert(arr**2, 0, 0)); rolling_var = (cumsum2[window:] - cumsum2[:-window]) / window - rolling_mean**2. Clip to zero before sqrt to handle floating-point precision artifacts.
🟡Loop applies a conditional operation — if/else branch per element
Immediate ActionUse np.where for simple two-branch logic, or boolean masking for more complex conditions. Both run entirely in C with no Python per-element evaluation.
Commands
result = np.where(arr > threshold, arr * 2, arr / 2)
mask = arr > threshold result = np.empty_like(arr) result[mask] = arr[mask] * 2 result[~mask] = arr[~mask] / 2
Fix Nownp.where evaluates both branches for all elements — if one branch is expensive and rarely true, boolean masking is more efficient because it only evaluates the relevant subset. For three or more branches, chain np.where calls or use np.select(conditions, choices, default).
Production IncidentPython Loop in Feature Pipeline Added 45 Minutes to Model TrainingA machine learning feature engineering pipeline used a Python for-loop to compute rolling statistics over a 50M-row dataset, turning a 3-minute job into a 48-minute bottleneck that delayed daily model retraining and pushed the pipeline well past its SLA window.
SymptomThe daily ML retraining pipeline started consistently exceeding its 30-minute SLA window. The feature engineering step — which had historically taken 3 minutes — was now taking 48 minutes on the same data volume. The team escalated the instance from a 4-vCPU machine to a 16-vCPU machine. Runtime barely changed. Monitoring showed exactly one core pegged at 100% while the other 15 sat idle.
AssumptionThe initial hypothesis was dataset growth — the training data had grown 30% over the prior quarter, and the team assumed that linear scaling of compute would fix a linear scaling of work. Two engineers spent half a day tuning instance types and memory configurations before anyone profiled the actual code.
Root causeA feature engineering function added in a recent sprint computed a rolling z-score using a Python for-loop: for i in range(window, len(arr)): window_data = arr[i-window:i] z = (arr[i] - window_data.mean()) / window_data.std() result[i] = z Each iteration sliced a 100-element subarray (allocating a new ndarray), called .mean() and .std() on that slice (two more C calls with Python overhead on each), computed the z-score as a Python float, and stored the result. With a 50M-row dataset and window=100, this executed 50M Python-level loop iterations with approximately 200M temporary array allocations across the full run. The 16-core upgrade made no difference because the GIL serialises Python thread execution. Every iteration was inherently single-threaded. More cores did not accelerate a single-threaded Python loop — they just sat idle while one core did all the work.
FixReplaced the loop with a vectorised implementation using cumulative sums. Rolling mean: cumsum = np.cumsum(np.insert(arr, 0, 0)) rolling_mean = (cumsum[window:] - cumsum[:-window]) / window Rolling variance using the computational formula E[X²] - E[X]²: cumsum2 = np.cumsum(np.insert(arr2, 0, 0)) rolling_var = (cumsum2[window:] - cumsum2[:-window]) / window - rolling_mean2 rolling_std = np.sqrt(np.maximum(rolling_var, 0)) # clip negative floats from precision loss rolling_z = (arr[window:] - rolling_mean) / rolling_std Runtime dropped from 48 minutes to 11 seconds — a 260x improvement. The pipeline runs comfortably inside its SLA window on the original 4-vCPU instance. Added a mandatory code review rule: any Python loop over more than 10K NumPy elements requires a written justification or a vectorised alternative before merge.
Key Lesson
Python loops over NumPy arrays do not benefit from multi-core scaling — the GIL serialises every iteration, and more cores help nothingScaling hardware does not fix algorithmic inefficiency — profile the code before provisioning larger instancesRolling statistics are vectorisable using cumulative sums — no sequential dependency exists between windows, making the loop-based implementation both unnecessary and extremely costlySet a code review threshold: any loop over more than 10K elements in a data or ML pipeline needs explicit justification or a vectorised alternativeAlways check CPU core utilisation before scaling — one core at 100% with the rest idle is the signature of a GIL-bound Python loop, not a resource-constrained workload
Production Debug GuideSymptoms that indicate a Python loop is your bottleneck — and what to verify before reaching for more hardware
Pipeline CPU usage shows one core at 100% while all others idleThis is the signature of a GIL-bound Python loop. Scaling the instance will not help. Profile with %timeit on the loop body to confirm it dominates runtime. If more than 50% of total pipeline time is in the loop, vectorise with NumPy ufuncs or compile with Numba @njit. Do not order a larger instance until the code is fixed.
Runtime scales linearly with array size and the operation should be O(n)Linear scaling is expected — but the constant factor is the problem. A vectorised O(n) operation has a constant factor of ~0.2ns per element. A Python loop O(n) operation has a constant factor of ~300ns per element. Same complexity, 1500x different constant. Replace with a ufunc and recheck the slope.
Memory usage spikes during loop execution — RSS grows then drops per batchThe loop body is creating intermediate arrays per iteration — slices, copies, temporary results from calling .mean() or .std() on subarrays. Each allocation is a Python-managed ndarray with its own reference count and GC overhead. Vectorise to compute in a single pre-allocated output array, or restructure to avoid per-iteration allocations.
np.vectorize used but benchmarks show no speedup over the original loopThis is expected and correct. np.vectorize wraps a Python loop with thin type-coercion handling — it provides zero execution speedup. If you need actual compiled performance, use Numba @njit. If you need the convenience of applying a Python function element-wise without caring about speed, np.vectorize is fine — but benchmark it and set expectations accordingly.
Loop processes each row of a 2D array independently — vectorised attempt fails with shape errorsThe operation likely vectorises along an axis. Try passing axis=0 or axis=1 to the NumPy function. For operations that don't have a direct axis argument, np.apply_along_axis is marginally more convenient than a loop but still runs Python-speed — restructure the computation to use broadcasting across the full 2D array instead.

The Cost of Python Loops

Every iteration of a Python for-loop over a NumPy array pays a fixed interpreter overhead that has nothing to do with the operation you're trying to compute. The runtime pays for: extracting the element from the array buffer and converting it to a Python float object (boxing), executing the loop body in Python bytecode, managing the reference count on every temporary object, and deallocating those objects at the end of each iteration.

This overhead is approximately 300 nanoseconds per element. It doesn't scale with the complexity of the operation — a simple addition and a complex trigonometric function each cost roughly the same interpreter overhead per iteration. The actual computation is a small fraction of the total cost for simple operations.

A vectorised call like arr.sum() takes a completely different path. NumPy passes a raw C pointer directly to a compiled function that processes the entire array using SIMD (Single Instruction, Multiple Data) instructions. Modern CPUs with AVX2 support process 4 double-precision floats per instruction. There is no Python overhead per element — zero boxing, zero reference counting, zero bytecode interpretation.

The overhead isn't proportional to the work per element — it's a fixed tax. For a trivial operation like addition, the Python overhead is roughly 99% of the total runtime. For a genuinely complex per-element computation, the C code dominates and the Python overhead shrinks in proportion. This is why the gap is largest for simple element-wise operations (100–150x) and smaller for complex per-element work.

In production pipelines, this overhead compounds badly. A 50M-element array processed in a Python loop at 300ns per element takes 15 seconds. The same array processed with a vectorised ufunc takes roughly 100 milliseconds. That's a 150x difference between a pipeline that meets its SLA and one that blows through it by 45 minutes.

io.thecodeforge.numpy.loop_cost.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
# io.thecodeforge.numpy.loop_cost
# Demonstrates the interpreter overhead cost of Python loops vs vectorised operations
# Run with: python loop_cost.py or in a Jupyter notebook with %timeit

import numpy as np
import time

arr = np.random.randn(1_000_000)  # 1M elements, float64

# ---- Python loop: pays interpreter overhead on every element ----
start = time.perf_counter()
total_loop = 0.0
for x in arr:
    total_loop += x
elapsed_loop = time.perf_counter() - start
print(f'Python loop:     {elapsed_loop:.3f}s  (result={total_loop:.4f})')

# ---- Vectorised: hands a C pointer to a compiled SIMD loop ----
start = time.perf_counter()
total_vec = arr.sum()
elapsed_vec = time.perf_counter() - start
print(f'NumPy vectorised: {elapsed_vec:.4f}s  (result={total_vec:.4f})')

print(f'Speedup: {elapsed_loop / elapsed_vec:.0f}x')
# Typical output on a modern CPU:
# Python loop:      0.287s
# NumPy vectorised: 0.0019s
# Speedup: ~150x

# ---- Per-element overhead breakdown ----
# Python loop: ~300ns per element
#   Boxing scalar to Python float:    ~80ns
#   Bytecode execution (BINARY_ADD):  ~100ns
#   Reference count increment:         ~40ns
#   Temporary object deallocation:     ~50ns
#   Loop bookkeeping (JUMP_ABSOLUTE):  ~30ns
#
# Vectorised: ~0.2ns per element
#   Raw memory access from C pointer:  ~0.1ns
#   SIMD add (AVX2 processes 4 doubles per instruction): ~0.1ns
#   Zero Python overhead per element

# ---- The overhead is fixed, not proportional ----
# For arr + 1 (trivial), interpreter overhead is 99% of runtime
# For np.sin(x) (complex), overhead shrinks because C computation dominates

# Simple operation: loop vs vectorised gap is largest
start = time.perf_counter()
result_loop = np.array([x + 1.0 for x in arr])
elapsed_add_loop = time.perf_counter() - start

start = time.perf_counter()
result_vec = arr + 1.0
elapsed_add_vec = time.perf_counter() - start

print(f'\nElement-wise add:')
print(f'  Loop:        {elapsed_add_loop:.3f}s')
print(f'  Vectorised:  {elapsed_add_vec:.4f}s')
print(f'  Speedup:     {elapsed_add_loop / elapsed_add_vec:.0f}x')
# Speedup for addition is typically 100-150x — overhead dominates

# Complex operation: overhead shrinks in proportion
start = time.perf_counter()
result_sin_loop = np.array([float(np.sin(x)) for x in arr])
elapsed_sin_loop = time.perf_counter() - start

start = time.perf_counter()
result_sin_vec = np.sin(arr)
elapsed_sin_vec = time.perf_counter() - start

print(f'\nElement-wise np.sin:')
print(f'  Loop:        {elapsed_sin_loop:.3f}s')
print(f'  Vectorised:  {elapsed_sin_vec:.4f}s')
print(f'  Speedup:     {elapsed_sin_loop / elapsed_sin_vec:.0f}x')
# Speedup for sin is typically 10-20x — C computation now significant
Mental Model
The Interpreter Tax — Fixed Overhead Per Element
A Python loop pays a ~300ns interpreter tax per element regardless of how simple or complex the operation is. Vectorised calls pay zero per-element Python overhead. The gap is not about algorithmic complexity — it is about where the loop runs.
  • Python loop: ~300ns per element — type check, boxing, bytecode execution, reference counting, deallocation, all for every single element
  • Vectorised call: ~0.2ns per element — raw C pointer passed to compiled code, SIMD processes 4–8 elements per CPU cycle
  • The tax is fixed: a trivial addition and a complex computation cost the same ~300ns in Python overhead
  • The gap closes for complex per-element work where the C computation itself takes time comparable to the Python overhead
  • SIMD (AVX2) processes 4 float64 values per instruction — vectorised code gets parallelism within a single core that Python loops structurally cannot
📊 Production Insight
The 150x speedup number is not a benchmark curiosity — it is the difference between a feature engineering pipeline that runs in 11 seconds and one that runs for 48 minutes on identical hardware.
The failure mode is predictable: an engineer writes a loop because it's the mental model they have, the loop works correctly on test data of 1,000 rows, and nobody notices the O(n) constant factor problem until the dataset grows to 50M rows in production. At that point, the incorrect mental model about what 'should' be fast has already been shipped.
The single most useful habit to develop: when you write a for-loop over a NumPy array, immediately ask 'does result[i] depend on result[i-1]?' If the answer is no, the loop can be vectorised and should be. If the answer is yes, it's a recurrence and you should reach for Numba rather than accepting Python-loop speed.
🎯 Key Takeaway
Python loop overhead is ~300ns per element — a fixed tax, independent of operation complexity. Vectorised NumPy runs at ~0.2ns per element using compiled C and SIMD. The 150x gap is structural: it comes from where the loop executes, not what it computes. The first question to ask about any NumPy loop: does the output at position i depend on the output at position i-1? If not, vectorise.
Loop Cost Decision Tree
IfLoop body is a simple arithmetic operation — addition, subtraction, multiplication, division, comparison
UseVectorise with a ufunc immediately. The Python overhead is 99% of runtime for these operations. The fix is one line.
IfLoop body calls a complex external function per element — ML model inference, API call, database query
UsePython overhead is negligible compared to the external call cost. Focus on batching the calls, not eliminating the loop. A batch inference call for 100 elements is far more impactful than vectorising the loop wrapper.
IfLoop has dependencies between iterations — result[i] uses result[i-1] or earlier results
UseCannot vectorise directly. Use Numba @njit to compile the loop to machine code. The sequential dependency is real, but the interpreter overhead is not — Numba preserves the dependency while eliminating the overhead.
IfLoop computes a rolling window statistic — mean, sum, std, variance
UseVectorise with cumulative sums — no sequential dependency exists between windows despite the sliding nature of the computation. Use the cumsum trick described in the next section.

Replacing Common Loop Patterns

Most Python loops over NumPy arrays are not doing anything fundamentally sequential. They just look sequential because element-by-element thinking is the default mental model. The patterns that appear most often in production codebases all have direct vectorised replacements — once you learn to recognise the pattern, the replacement is mechanical.

Element-wise arithmetic is the simplest: any loop that computes output[i] = f(input[i]) using arithmetic operators is directly replaceable with broadcasting. NumPy broadcasts scalar operations across the entire array in a single compiled call.

np.diff handles adjacent differences: any loop computing output[i] = arr[i+1] - arr[i] is a np.diff(arr) call.

np.cumsum and np.cumprod handle running totals and products. They also enable the rolling window trick, which is the most impactful pattern to recognise.

np.clip handles value clamping without a conditional per element. np.where handles two-branch conditionals. np.select handles multiple branches.

The rolling window pattern deserves particular attention because it appears constantly in time-series and signal processing code and is non-obvious to vectorise at first glance. The key insight is that a rolling sum of window W at position i is: sum(arr[i-W:i]) = cumsum[i] - cumsum[i-W]. If you have the prefix sum array precomputed, every window sum is a single subtraction — O(1) per element, no inner loop, no temporary array allocation. Rolling mean follows immediately. Rolling variance uses the computational formula Var(X) = E[X²] - E[X]², decomposed the same way with a second prefix sum over the squared values.

io.thecodeforge.numpy.patterns.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
# io.thecodeforge.numpy.patterns
# Direct replacements for the most common Python loop patterns over NumPy arrays
# Each pattern shown as: loop version (slow) -> vectorised version (fast)

import numpy as np

arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])

# ---- Pattern 1: Adjacent differences ----
# Loop (slow):
diffs_loop = [arr[i+1] - arr[i] for i in range(len(arr) - 1)]
# Vectorised:
diffs = np.diff(arr)
print(f'np.diff:    {diffs}')  # [1. 1. 1. 1. 1. 1. 1.]

# ---- Pattern 2: Cumulative sum ----
# Loop (slow):
cumulative_loop = [sum(arr[:i+1]) for i in range(len(arr))]
# Vectorised:
cumulative = np.cumsum(arr)
print(f'np.cumsum:  {cumulative}')  # [ 1.  3.  6. 10. 15. 21. 28. 36.]

# ---- Pattern 3: Value clamping ----
# Loop (slow):
clipped_loop = [min(max(x, 2.0), 6.0) for x in arr]
# Vectorised:
clipped = np.clip(arr, 2.0, 6.0)
print(f'np.clip:    {clipped}')  # [2. 2. 3. 4. 5. 6. 6. 6.]

# ---- Pattern 4: Conditional assignment — two branches ----
# Loop (slow):
result_loop = np.empty_like(arr)
for i in range(len(arr)):
    if arr[i] > 4.0:
        result_loop[i] = arr[i] * 10
    else:
        result_loop[i] = arr[i] * 0.5

# Vectorised with np.where:
result_where = np.where(arr > 4.0, arr * 10, arr * 0.5)
print(f'np.where:   {result_where}')  # [ 0.5  1.   1.5  2.   50.  60.  70.  80. ]

# Vectorised with boolean masking (preferred when branches have very different costs):
# Boolean masking only evaluates the branch for the subset that matches
result_mask = np.empty_like(arr)
mask = arr > 4.0
result_mask[mask] = arr[mask] * 10
result_mask[~mask] = arr[~mask] * 0.5
print(f'bool mask:  {result_mask}')   # same result

# Note: np.where(cond, x, y) evaluates BOTH x and y for all elements,
# then selects. Boolean masking evaluates each branch only for the matching
# subset — more efficient when one branch is expensive and rarely true.

# ---- Pattern 5: Multiple branches — np.select ----
# Extends np.where to N conditions
conditions = [arr < 2.0, (arr >= 2.0) & (arr < 5.0), arr >= 5.0]
choices = [arr * 0.0, arr * 1.0, arr * 2.0]  # zero, keep, double
result_select = np.select(conditions, choices, default=arr)
print(f'np.select:  {result_select}')  # [0. 0. 3. 4. 10. 12. 14. 16.]

# ---- Pattern 6: Rolling window statistics (the key production pattern) ----
# Loop (slow) — creates 1 temporary array per iteration:
# for i in range(window, len(arr)):
#     rolling_mean[i] = arr[i-window:i].mean()

def rolling_stats(arr, window):
    """Rolling mean and std using cumulative sums. O(n), no Python loop."""
    # Pad with zero at position 0 so indexing arithmetic works cleanly
    cs = np.cumsum(np.insert(arr, 0, 0.0))    # prefix sum
    cs2 = np.cumsum(np.insert(arr**2, 0, 0.0))  # prefix sum of squares

    rolling_sum = cs[window:] - cs[:-window]
    rolling_sum2 = cs2[window:] - cs2[:-window]

    mean = rolling_sum / window
    # Computational formula: Var = E[X^2] - E[X]^2
    # Clip to 0 to avoid negative values from floating-point precision loss
    var = np.maximum(rolling_sum2 / window - mean**2, 0.0)
    std = np.sqrt(var)

    return mean, std

mean, std = rolling_stats(arr, window=3)
print(f'Rolling mean (w=3): {mean}')  # [2. 3. 4. 5. 6. 7.]
print(f'Rolling std  (w=3): {std}')   # [~0.816 ~0.816 ...]

# ---- Pattern 7: Normalisation across rows of a 2D array ----
# Loop (slow):
matrix = np.random.randn(1000, 50)  # 1000 rows, 50 features
norm_loop = np.empty_like(matrix)
for i in range(len(matrix)):
    norm_loop[i] = (matrix[i] - matrix[i].mean()) / matrix[i].std()

# Vectorised with broadcasting — keepdims preserves shape for broadcasting:
means = matrix.mean(axis=1, keepdims=True)  # shape (1000, 1)
stds = matrix.std(axis=1, keepdims=True)    # shape (1000, 1)
norm_vec = (matrix - means) / stds          # broadcasts across 50 columns
print(f'Row normalisation matches: {np.allclose(norm_loop, norm_vec)}')
💡Pattern Recognition Cheat Sheet
  • Element-wise math: output[i] = f(arr[i]) — use arr * scale, arr + offset, np.exp(arr), etc. Broadcasting handles scalar and array operands.
  • Adjacent differences: arr[i+1] - arr[i] — use np.diff(arr). For higher-order differences, np.diff(arr, n=2) gives second differences.
  • Cumulative operations: running total or product — use np.cumsum(arr) or np.cumprod(arr). Both O(n), no loop.
  • Value clamping: min(max(x, lo), hi) — use np.clip(arr, lo, hi). Handles NaN-safe behavior with np.nanmin/nanmax variants.
  • Two-branch conditional: if/else per element — use np.where(condition, x, y). For expensive branches, prefer boolean masking to avoid evaluating both branches.
  • Rolling window stats: sliding mean, std, variance — use the cumulative sum trick with prefix sums. One np.cumsum call replaces the entire inner loop.
  • Row/column operations on 2D arrays: normalise, scale, shift per row — use axis parameter and keepdims=True for broadcasting. No explicit loop over rows.
📊 Production Insight
Rolling window loops are the single most common hidden bottleneck in ML feature engineering pipelines. They appear everywhere — time series preprocessing, signal normalization, technical indicators, anomaly detection — and they look inherently sequential because you're sliding a window along the array. But they're not sequential. Each window is independent.
The cumulative sum trick works because addition is associative: sum(arr[i:i+W]) = cumsum[i+W] - cumsum[i]. Once you have the prefix sum, any window sum is a single subtraction. No inner loop, no temporary array allocation per window, no Python overhead at all. For a 50M-element array, this replaces 50M Python iterations with two np.cumsum calls and one np.subtract call — all in C.
The floating-point precision issue with rolling variance is real and worth handling explicitly. When computing E[X²] - E[X]², the subtraction of two large numbers can produce small negative values due to floating-point rounding. The np.maximum(var, 0.0) call before sqrt prevents domain errors. This is not a theoretical concern — it fires on real production data with near-constant windows.
🎯 Key Takeaway
Most loop patterns have a one-line NumPy replacement. The five to memorize: np.diff for adjacent differences, np.cumsum for running totals, np.clip for clamping, np.where for conditionals, and the cumsum trick for rolling windows. Pattern recognition is the skill — the replacement is mechanical once you see the structure. The rolling window trick is the highest-value one to internalize because it appears constantly and the speedup is largest.

When You Cannot Avoid a Loop — np.vectorize vs Numba

Some operations genuinely cannot be vectorised because each output depends on a previously computed output — not on the input at the same position. Running maximums, exponential smoothing, Fibonacci-style recurrences, and certain signal filters all have this structure. These are not problems of pattern recognition — they are structurally sequential. The question then is not whether to loop, but where the loop should run.

np.vectorize is the most misunderstood function in NumPy. Engineers see 'vectorize' in the name and assume it implies compiled execution or parallelism. It implies neither. Reading the NumPy documentation directly: 'The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.' The implementation is a Python loop with thin type-coercion wrapping on each side. It provides convenience — you can pass array arguments to a function that expects scalars — not speed. Benchmarking np.vectorize against a hand-written Python loop reliably shows the same runtime, or slightly worse due to the extra function call overhead per element.

For genuinely sequential operations that must run fast, Numba is the correct tool. The @njit decorator ('no Python, just-in-time') compiles the decorated function to LLVM machine code on the first call. The compiled function runs with zero Python overhead — the sequential logic is preserved, but the interpreter is completely bypassed. Compilation takes roughly 0.5–2 seconds on the first call, after which the compiled version is cached. For a loop over 10M elements that previously took 3 seconds in Python, the Numba-compiled version typically runs in 20–50 milliseconds — a 60–150x improvement.

The Numba constraint is that all types must be statically inferable from the function signature. This means: no Python dicts or sets inside @njit functions, no arbitrary Python objects, and no calls to Python functions that aren't themselves @njit-compiled. For data pipeline work using NumPy arrays and scalars, this is almost never a limitation in practice.

Cython is an alternative when the team already uses it and wants a separate build step with explicit type annotations in a .pyx file. The performance outcome is comparable to Numba. Numba requires less code change — a decorator on an existing Python function versus a rewrite in Cython syntax — which makes it the right default for most teams.

io.thecodeforge.numpy.vectorize_vs_numba.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
# io.thecodeforge.numpy.vectorize_vs_numba
# Demonstrates:
# 1. np.vectorize — what it actually does (convenience, not speed)
# 2. Numba @njit — genuine compiled performance for sequential logic
# 3. Side-by-side benchmark showing the difference

import numpy as np
import time

rng = np.random.default_rng(42)
arr = rng.standard_normal(10_000_000).astype(np.float64)  # 10M elements

# ---- np.vectorize: the convenience wrapper that is NOT faster ----
# Under the hood: Python loop + per-element type coercion
# Use this when you want array broadcasting on a scalar function
# Do NOT use this when you want speed

def clip_scalar(x, lo, hi):
    """Works on scalars. np.vectorize will call this once per element."""
    return lo if x < lo else (hi if x > hi else x)

vclip = np.vectorize(clip_scalar)

# For a small demo array:
small = np.array([-0.5, 0.3, 0.7, 1.5, -0.1])
print(f'np.vectorize result: {vclip(small, 0.0, 1.0)}')
# [0.  0.3 0.7 1.  0. ]

# Benchmark: np.vectorize vs raw loop vs numpy built-in
start = time.perf_counter()
_ = vclip(arr, 0.0, 1.0)
time_vectorize = time.perf_counter() - start

start = time.perf_counter()
result_py = np.empty_like(arr)
for i, x in enumerate(arr):
    result_py[i] = 0.0 if x < 0.0 else (1.0 if x > 1.0 else x)
time_loop = time.perf_counter() - start

start = time.perf_counter()
_ = np.clip(arr, 0.0, 1.0)  # built-in ufunc
time_numpy = time.perf_counter() - start

print(f'np.vectorize: {time_vectorize:.2f}s')
print(f'Python loop:  {time_loop:.2f}s')
print(f'np.clip:      {time_numpy:.4f}s')
# Typical output:
# np.vectorize: 4.87s  <- same order as the loop, sometimes worse
# Python loop:  4.61s  <- the baseline we're trying to beat
# np.clip:      0.031s <- 150x faster — this is what vectorisation means

# ---- Numba @njit: compiled sequential logic ----
# Use this when the loop has genuine sequential dependencies
# (result[i] depends on result[i-1]) and cannot be vectorised

try:
    from numba import njit

    @njit
    def running_max(arr):
        """
        Running maximum: result[i] = max(arr[0], arr[1], ..., arr[i]).
        Genuinely sequential — result[i] depends on result[i-1].
        Cannot be vectorised with standard NumPy.
        """
        result = np.empty_like(arr)
        result[0] = arr[0]
        for i in range(1, len(arr)):
            result[i] = result[i-1] if result[i-1] > arr[i] else arr[i]
        return result

    @njit
    def exponential_smooth(arr, alpha):
        """
        Exponential weighted moving average (EWMA).
        result[i] = alpha * arr[i] + (1 - alpha) * result[i-1]
        Classic sequential recurrence — each output depends on the prior output.
        """
        result = np.empty_like(arr)
        result[0] = arr[0]
        for i in range(1, len(arr)):
            result[i] = alpha * arr[i] + (1.0 - alpha) * result[i-1]
        return result

    # Warmup call — triggers LLVM compilation (~0.5-2s one-time cost)
    _ = running_max(arr[:100])
    _ = exponential_smooth(arr[:100], 0.1)

    # Benchmark compiled Numba vs equivalent Python loop
    start = time.perf_counter()
    result_nb = running_max(arr)
    time_numba = time.perf_counter() - start

    # Python equivalent for comparison
    start = time.perf_counter()
    result_py2 = np.empty_like(arr)
    result_py2[0] = arr[0]
    for i in range(1, len(arr)):
        result_py2[i] = max(result_py2[i-1], arr[i])
    time_python2 = time.perf_counter() - start

    print(f'\nRunning max (10M elements):')
    print(f'  Python loop: {time_python2:.2f}s')
    print(f'  Numba @njit: {time_numba:.4f}s')
    print(f'  Speedup:     {time_python2 / time_numba:.0f}x')
    print(f'  Results match: {np.allclose(result_nb, result_py2)}')
    # Typical output:
    # Python loop: 3.84s
    # Numba @njit: 0.022s
    # Speedup:     ~175x

except ImportError:
    print('Numba not installed — pip install numba')
    print('For environments without Numba, Cython is an alternative')
    print('For pandas integration, consider bottleneck or scipy alternatives')

# ---- Summary: what to use when ----
# np.vectorize: when you want to apply a scalar Python function element-wise
#               for API convenience — accept Python loop speed
# np.clip / np.where / ufuncs: when the operation has a NumPy equivalent
#               — 100-300x faster than any loop
# Numba @njit: when the loop has genuine sequential dependencies
#               — 50-200x faster than Python loop, near-C speed
# Cython: when team already uses it and prefers explicit type annotations
#               — similar performance to Numba, more setup overhead
⚠ np.vectorize Is Not a Performance Tool — The Name Is the Trap
📊 Production Insight
np.vectorize is the single most reliable source of false confidence in NumPy performance. The pattern is predictable: an engineer has a slow loop, learns about np.vectorize, wraps the loop body in it, sees no speedup, and concludes that the loop is 'already as fast as it can get.' The loop is not as fast as it can get. The engineer used the wrong tool.
The correct mental model for non-vectorisable loops is to think of Numba @njit as a compiler directive, not a library call. You write Python code inside an @njit function, and Numba compiles it to machine code using LLVM. The Python interpreter is not involved at runtime. The sequential logic you needed is preserved. The interpreter overhead you wanted to eliminate is gone.
One practical Numba tip: if the @njit function fails to compile, the error message identifies the specific Python construct that Numba cannot handle. The common ones are: passing a Python list (convert to np.array first), using f-strings or print inside @njit (Numba supports a limited print), and calling a non-@njit Python function from inside @njit code. Each fix is usually one line.
🎯 Key Takeaway
np.vectorize is API convenience with zero performance benefit — it runs a Python loop. The name is the trap. For genuinely sequential loops with inter-iteration dependencies, Numba @njit compiles to machine code in under two seconds of one-time setup cost and runs 50–200x faster than the Python equivalent. Always benchmark np.vectorize against the raw loop before concluding anything about performance — they will match, which tells you nothing has improved.
Non-Vectorisable Loop Decision Tree
IfLoop has sequential dependency — result[i] depends on result[i-1] or earlier outputs
UseUse Numba @njit. The sequential dependency is structural and must be preserved — but the interpreter overhead does not. Numba compiles the exact same logic to machine code.
IfLoop body calls an external function per element — HTTP API, database query, subprocess
UseCannot vectorise or compile — the bottleneck is the external call, not Python overhead. Batch the calls where the API supports it, or use asyncio/concurrent.futures for I/O-bound parallelism.
IfOne-off script, Numba is not installed, loop runs once on a dataset small enough to tolerate Python speed
Usenp.vectorize is acceptable for convenience — it saves you from writing the loop explicitly. Accept the Python loop speed. If this becomes a production pipeline, revisit.
IfLoop looks like it needs sequential access but on inspection the output at i only depends on inputs, not prior outputs
UseThe loop IS vectorisable — it just looks sequential because it's written that way. Find the NumPy equivalent (likely broadcasting, cumsum, or a ufunc) and replace it.

Broadcasting and Memory Layout: The Hidden Performance Factors

Vectorisation eliminates Python interpreter overhead, but two additional factors determine how fast the resulting C-compiled operations run: whether NumPy can apply broadcasting without creating intermediate copies, and whether the memory access pattern aligns with CPU cache lines.

Broadcasting lets NumPy perform operations between arrays of different shapes without allocating expanded copies. When you subtract a mean vector of shape (1000,) from a matrix of shape (1000, 50), NumPy broadcasts the subtraction without materialising a (1000, 50) copy of the mean vector. This is fast. When the operation requires an intermediate result that doesn't fit the broadcast rules, NumPy allocates a full-size temporary array — and that allocation, which happens in C but still consumes memory bandwidth, can dominate the runtime for large arrays.

Memory layout — C-order (row-major) versus Fortran-order (column-major) — matters because CPU caches load contiguous memory efficiently. A C-order NumPy array stores rows contiguously. Iterating over rows, or applying operations along axis=1, accesses contiguous memory and benefits from cache prefetching. Iterating over columns accesses every nth element in memory, causing cache misses on every access. For large arrays, cache-unfriendly access patterns can cost 5–10x in throughput versus cache-friendly access, independent of Python versus C execution.

In-place operations avoid allocation entirely. arr += 1.0 modifies the array in place and allocates no temporary. arr + 1.0 allocates a new array of the same size to hold the result, then the original may be garbage collected. For a 1GB array, that allocation difference is the difference between a 1GB and 2GB peak memory usage — which may or may not matter, but it's always worth being explicit about.

io.thecodeforge.numpy.broadcasting_memory.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
# io.thecodeforge.numpy.broadcasting_memory
# Broadcasting efficiency, memory layout effects, and in-place operations
# These factors affect performance after you've already eliminated the Python loop

import numpy as np
import time

# ---- Broadcasting: no copies for shape-compatible operations ----
rng = np.random.default_rng(0)
matrix = rng.standard_normal((10_000, 500))  # 10K rows, 500 columns

# Row normalisation via broadcasting — no temporary (1000, 500) copy of means
means = matrix.mean(axis=1, keepdims=True)  # shape: (10000, 1)
stds = matrix.std(axis=1, keepdims=True)    # shape: (10000, 1)
# The subtraction broadcasts (10000, 1) across 500 columns without materialising
# a full (10000, 500) copy of the means
start = time.perf_counter()
norm = (matrix - means) / stds
time_broadcast = time.perf_counter() - start
print(f'Broadcasting normalisation: {time_broadcast:.4f}s')

# Equivalent with explicit expansion — forces a copy allocation
means_expanded = np.tile(means, (1, 500))   # shape: (10000, 500) — full copy
start = time.perf_counter()
norm_tiled = (matrix - means_expanded) / stds
time_tiled = time.perf_counter() - start
print(f'Tiled (explicit copy):      {time_tiled:.4f}s')
print(f'Broadcasting speedup:       {time_tiled / time_broadcast:.1f}x')
# Broadcasting is typically 1.5-3x faster because it avoids the tile allocation

# ---- Memory layout: C-order vs Fortran-order ----
# C-order: rows are contiguous — efficient for row-wise operations
# F-order: columns are contiguous — efficient for column-wise operations

c_arr = np.ascontiguousarray(rng.standard_normal((5000, 5000)))  # C-order
f_arr = np.asfortranarray(c_arr.copy())                          # F-order

# Row sum: accessing contiguous elements in C-order
start = time.perf_counter()
for _ in range(5):
    _ = c_arr.sum(axis=1)  # sum across columns per row
time_c_row = (time.perf_counter() - start) / 5

start = time.perf_counter()
for _ in range(5):
    _ = f_arr.sum(axis=1)  # same operation, Fortran layout — cache-unfriendly
time_f_row = (time.perf_counter() - start) / 5

print(f'\nRow sum, C-order:        {time_c_row:.4f}s')
print(f'Row sum, Fortran-order:  {time_f_row:.4f}s')
print(f'Layout speedup (row op): {time_f_row / time_c_row:.1f}x')
# C-order is typically 2-5x faster for row operations due to cache locality

# Verify memory layout:
print(f'\nc_arr is C-contiguous: {c_arr.flags["C_CONTIGUOUS"]}')
print(f'f_arr is F-contiguous: {f_arr.flags["F_CONTIGUOUS"]}')

# ---- In-place operations: avoid intermediate allocations ----
large = rng.standard_normal(10_000_000).astype(np.float64)

# Out-of-place: allocates a 80MB temporary array
start = time.perf_counter()
result_oop = large + 1.0  # allocates new array
time_oop = time.perf_counter() - start

# In-place: no allocation — modifies large directly
large_copy = large.copy()  # preserve original for fair comparison
start = time.perf_counter()
large_copy += 1.0  # modifies in place, no allocation
time_inplace = time.perf_counter() - start

print(f'\nOut-of-place (large + 1.0): {time_oop:.4f}s')
print(f'In-place (large += 1.0):    {time_inplace:.4f}s')
print(f'In-place speedup:           {time_oop / time_inplace:.1f}x')
# In-place is typically 1.5-2x faster for large arrays
# because it avoids memory allocation and halves the memory bandwidth required

# ---- Practical rule: check array flags before assuming layout ----
print(f'\nCheck any array layout with arr.flags:')
print(large.flags)
💡Three Things That Slow Down Even Vectorised Code
  • Unnecessary intermediate copies: np.tile, np.repeat, and explicit reshapes before broadcasting can force full-size copies that broadcasting itself would avoid. Use keepdims=True and let NumPy broadcast.
  • Cache-unfriendly access patterns: column-wise operations on C-order arrays (or row-wise on Fortran-order) cause cache misses on every access. Use arr.T for efficient column operations, or ensure your array layout matches your primary access pattern.
  • Out-of-place vs in-place: arr = arr + 1.0 allocates a new array; arr += 1.0 does not. For arrays larger than ~100MB, this allocation is a measurable fraction of runtime and doubles peak memory usage.
  • Check array contiguity with arr.flags before optimising memory access patterns — np.ascontiguousarray(arr) forces C-order layout if the array is not already contiguous.
  • np.einsum is often faster than chained operations for complex tensor contractions — it fuses operations and avoids intermediate allocations that separate np.dot and np.sum calls would create.
📊 Production Insight
The broadcasting and memory layout issues become significant at array sizes above roughly 100MB — below that, the allocation cost and cache miss penalty are small enough that other factors dominate. In ML feature pipelines processing millions of rows with hundreds of features, these factors start mattering.
The most common production case is row normalisation on a large feature matrix. The canonical implementation is (matrix - matrix.mean(axis=1, keepdims=True)) / matrix.std(axis=1, keepdims=True). Both operations broadcast correctly, no intermediate copies, and the keepdims=True handles the shape arithmetic automatically. Engineers who haven't seen keepdims=True will often compute the mean, reshape it manually with mean.reshape(-1, 1), and achieve the same result — but the explicit reshape communicates intent less clearly and is marginally more error-prone when the array dimensionality changes.
🎯 Key Takeaway
After eliminating Python loop overhead, the remaining performance levers are memory allocation (in-place vs out-of-place), cache locality (C-order vs F-order for your access pattern), and broadcasting efficiency (keepdims=True avoids intermediate copies). For arrays above ~100MB in production pipelines, these factors produce measurable differences. Profile with memory_profiler and line_profiler after vectorising loops — the bottleneck often shifts.
Memory and Layout Optimisation Decision Tree
IfVectorised operation is still slower than expected — profiling shows memory allocation time dominates
UseSwitch to in-place operations (+=, -=, *=, /=) for large arrays. Use np.out= parameter where available to write directly to a pre-allocated output array.
IfOperation accesses 2D array along the short dimension (many small rows or many small columns)
UseCheck arr.flags for memory layout. If operations are column-wise on a C-order array, consider transposing to F-order or restructuring to row-wise access.
IfBroadcasting operation requires np.tile or np.repeat before the main computation
UseRemove the tile/repeat and use keepdims=True on the aggregation. NumPy will broadcast without materialising the full-size copy.
IfMultiple chained array operations on large arrays — each creates an intermediate
UseConsider np.einsum for fused tensor contractions, or use numexpr for element-wise expression fusion that avoids intermediate allocation.
🗂 NumPy Loop Optimisation Methods: Complete Comparison
Performance, trade-offs, and use cases for each approach — choose based on whether the loop has sequential dependencies and how much setup overhead is acceptable
MethodSpeedup vs Python LoopWhen to UseTrade-off
Vectorised ufunc / broadcasting100–300xElement-wise arithmetic, comparisons, math functions, aggregations — any operation where output[i] depends only on input[i] or the whole arrayRequires expressing the logic as NumPy operations. Not all logic translates directly, but most element-wise operations do.
Boolean masking / np.where / np.select100–200xConditional assignment per element — if/else logic, value replacement, multi-branch selectionnp.where evaluates both branches for all elements before selecting — use boolean masking when one branch is expensive and rarely true, to evaluate only the matching subset.
np.diff / np.cumsum / np.clip / rolling cumsum trick50–150xAdjacent differences, running totals, value clamping, rolling window statistics — specific mathematical patterns with direct NumPy equivalentsOnly applies when the operation fits one of these patterns. Rolling cumsum only works for associative operations (sum, sum of squares) — not median, quantile, or mode.
Numba @njit50–200xSequential dependencies (result[i] uses result[i-1]), recurrences, exponential smoothing, running statistics that require prior output valuesOne-time LLVM compilation cost of 0.5–2s on first call. Types must be statically inferable — no arbitrary Python objects inside @njit functions. Numba must be installed separately.
np.vectorize0–5% (negligible, sometimes negative)API convenience only — applying a scalar Python function to an array input when you do not care about performance and want to avoid writing the loop yourselfNo performance benefit at all. The name is misleading. Benchmark against the raw loop before using — the runtimes will match. Never use this expecting speedup.
Cython50–200xLarge production codebases that already use Cython, or when explicit C-level type annotations are preferred over Numba's type inferenceRequires separate .pyx files, explicit type annotations, and a build step integrated into the package setup. Higher setup overhead than Numba for new code.
In-place operations + keepdims broadcasting1.5–3x (additive after other optimisations)Large arrays (>100MB) where memory allocation time is a measurable fraction of total operation timeModifies the original array — requires explicit .copy() if the original must be preserved. Cannot always replace out-of-place operations when intermediate results are needed.

🎯 Key Takeaways

  • Python loops over NumPy arrays pay ~300ns per element in interpreter overhead — a fixed tax independent of operation complexity. Vectorised ufuncs pay ~0.2ns per element in compiled C. The 150x gap is structural, not incidental.
  • np.vectorize is API convenience with zero performance benefit — it runs a Python loop. The name is misleading. Benchmark it against the raw loop to confirm they match, then use the right tool instead.
  • Most loop patterns have a direct NumPy replacement: np.diff for adjacent differences, np.cumsum for running totals, np.clip for clamping, np.where for conditionals. The rolling window pattern — the most common hidden bottleneck — is vectorisable using the prefix sum trick.
  • For loops with genuine sequential dependencies (result[i] depends on result[i-1]), use Numba @njit — it compiles the exact same logic to machine code with 50–200x speedup over the Python loop while preserving the sequential structure.
  • Scaling hardware does not fix algorithmic inefficiency — the GIL serialises Python loops regardless of core count. Profile first, identify the actual bottleneck, then vectorise or compile before reaching for a larger instance.

⚠ Common Mistakes to Avoid

    Using np.vectorize expecting a performance speedup
    Symptom

    Code is 'vectorised' with np.vectorize, benchmarks show identical runtime to the raw Python loop, engineer concludes the loop 'cannot be optimised further'

    Fix

    np.vectorize is a Python loop with type-coercion wrapping — it provides zero execution speedup. For actual performance improvement: if the operation is expressible as NumPy ufuncs or broadcasting, write it that way. If the loop has genuine sequential dependencies, use Numba @njit. The name is the trap — benchmark np.vectorize against the raw loop first to confirm they match, then choose the right tool.

    Writing a Python loop for element-wise arithmetic instead of broadcasting
    Symptom

    Simple scaling, shifting, or clipping operations run 100x slower than expected — monitoring shows one CPU core at 100%, others idle

    Fix

    Replace with broadcasting expressions: arr * scale, arr + offset, np.clip(arr, lo, hi), np.exp(arr), np.log(arr). NumPy broadcasts scalars and shape-compatible arrays across the full array in compiled C. These are single function calls with zero Python overhead per element.

    Writing a Python loop for rolling window statistics
    Symptom

    Feature engineering pipeline runs for minutes instead of seconds — profiling shows the loop creating temporary array allocations per iteration, RSS growing and shrinking in a sawtooth pattern

    Fix

    Use the cumulative sum trick: cs = np.cumsum(np.insert(arr, 0, 0)); rolling_sum = cs[window:] - cs[:-window]; rolling_mean = rolling_sum / window. Extend to variance with a second prefix sum over squared values. No Python loop, no temporary allocations, O(n) with a small constant. Clip variance to zero before sqrt to handle floating-point precision artifacts.

    Scaling hardware to fix a Python loop bottleneck
    Symptom

    Instance upgraded from 4 to 16 vCPU and runtime improves by less than 10% — monitoring confirms one core at 100% and the rest idle throughout the run

    Fix

    The GIL prevents multiple threads from executing Python bytecode simultaneously. More cores do not parallelize a Python loop. Vectorise the loop with NumPy or compile it with Numba @njit. Fix the algorithm — then evaluate whether additional cores provide further benefit through NumPy's internal BLAS/LAPACK threading.

    Optimising the wrong loop — spending time on a 2% bottleneck while ignoring the 95% bottleneck
    Symptom

    Engineer spends a day vectorising a loop that shows no measurable improvement to total pipeline runtime — the real bottleneck is I/O, memory allocation, or a different loop entirely

    Fix

    Profile first with %timeit or cProfile before writing any optimisation code. Use line_profiler for per-line timing within functions. Identify that the loop is actually the bottleneck before optimising it. A rule of thumb: if the loop accounts for less than 20% of total runtime, optimising it will not move the needle on the metric you care about.

Interview Questions on This Topic

  • QWhy is iterating over a NumPy array with a Python for loop slow?JuniorReveal
    The root cause is that Python for-loops over NumPy arrays execute in the Python interpreter, paying a fixed overhead per element that has nothing to do with the computation being performed. For each iteration, the interpreter: extracts the element from the array's C buffer, boxes it as a Python float object (allocating a Python object on the heap), executes the loop body in Python bytecode, manages reference counts on every temporary object, and deallocates those objects at the end of the iteration. This overhead runs to approximately 300 nanoseconds per element — regardless of whether the operation inside the loop is a trivial addition or a complex computation. A vectorised call like arr.sum() takes a completely different path: NumPy hands a raw C pointer directly to a compiled function. The compiled function runs a C-level loop with SIMD instructions — modern CPUs process 4 float64 values per instruction with AVX2. There is no Python overhead per element at all. For a 1M-element array with simple addition, the Python loop takes roughly 300ms and the vectorised call takes roughly 2ms — a 150x difference. The gap is largest for simple operations where the Python overhead dominates the actual computation, and shrinks for complex per-element work where the C computation starts to take significant time itself.
  • QWhat does np.vectorize actually do under the hood?JuniorReveal
    np.vectorize is a Python function that calls your provided function once per element of the input array, in a Python for-loop. It does not compile your function to C or machine code. It does not parallelise execution. It does not use SIMD instructions. The NumPy documentation says explicitly: 'The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function' and 'The vectorize function is provided primarily for convenience, not for performance.' What np.vectorize actually provides is: automatic handling of NumPy broadcasting for scalar functions (so a function that expects scalar inputs works with array inputs), automatic stacking of scalar return values into an output array, and optional support for specifying output dtype and excluded parameters. The performance is identical to — or marginally worse than — a hand-written Python loop, because it adds a thin layer of type-coercion and broadcasting handling on top of the same per-element Python function calls. The name is the problem. Engineers see 'vectorize' and assume it implies compiled execution. It doesn't. For actual performance improvement on non-vectorisable operations, use Numba's @njit decorator, which compiles the decorated function to LLVM machine code on first call and runs with no Python overhead on subsequent calls.
  • QA feature engineering pipeline processes a 50M-row dataset with a rolling window loop. The loop computes rolling mean and std for a window of 100. How would you optimise this?SeniorReveal
    This is the rolling window pattern, and it's fully vectorisable using prefix sums — no sequential dependency exists between windows despite appearances. For rolling mean: compute the prefix sum with a zero prepended, then use strided indexing to get the window sum at each position. cumsum = np.cumsum(np.insert(arr, 0, 0.0)) rolling_sum = cumsum[window:] - cumsum[:-window] rolling_mean = rolling_sum / window For rolling variance, use the computational formula Var(X) = E[X²] - E[X]²: cumsum2 = np.cumsum(np.insert(arr2, 0, 0.0)) rolling_sum2 = cumsum2[window:] - cumsum2[:-window] rolling_var = rolling_sum2 / window - rolling_mean2 rolling_std = np.sqrt(np.maximum(rolling_var, 0.0)) # clip negatives from float precision The np.maximum(rolling_var, 0.0) before sqrt is important — floating-point precision loss in the subtraction of two large numbers can produce small negative values that cause sqrt to return NaN. For a 50M-row dataset with window=100, this implementation runs in approximately 11 seconds versus 48 minutes for the Python loop version — roughly a 260x improvement on the same hardware. This approach only works for associative operations — sum, sum of squares, count. For rolling median, rolling quantile, or rolling mode, the prefix sum trick doesn't apply. For those, scipy.ndimage provides optimised implementations, or a Numba-compiled loop is the right tool.
  • QWhen would you choose Numba @njit over a vectorised NumPy approach, and what are the practical limitations?SeniorReveal
    Numba @njit is the right choice when the operation has genuine sequential dependencies — when computing result[i] requires result[i-1] or earlier outputs — making the loop structurally impossible to vectorise with standard NumPy operations. Canonical examples: exponential weighted moving average (EWMA), running maximum or minimum, certain IIR filters, Viterbi algorithm, and any finite state machine computed over a sequence. These all have the property that the output at each position depends on the output at the prior position, not just the input. For these operations, Numba @njit compiles the decorated function to LLVM machine code on the first call (typically 0.5–2 seconds of one-time overhead). Subsequent calls run at near-C speed with no Python overhead. For a 10M-element running maximum, Numba typically achieves roughly 170x speedup over the Python loop equivalent. Practical limitations: All types used inside the @njit function must be statically inferable — Numba performs type inference at compile time. This means no Python dicts or sets (use structured arrays), no arbitrary Python objects, and no calls to Python functions that aren't themselves @njit-compiled. On first encountering a limitation, the error message from Numba is specific about what construct caused the issue, which makes debugging straightforward. For one-off analysis scripts where Numba installation adds complexity, scipy.ndimage provides optimised implementations of many rolling statistics. For pandas-integrated pipelines, bottleneck provides highly optimised replacements for common pandas/NumPy rolling operations. Numba is the right choice when you need a custom sequential algorithm that none of these libraries cover.

Frequently Asked Questions

Is np.vectorize() faster than a Python loop?

No — or so marginally that the difference is noise. The NumPy documentation states directly that np.vectorize is 'provided primarily for convenience, not for performance.' It calls your Python function once per element in a Python loop, adding thin type-coercion handling on each side. Benchmarking consistently shows np.vectorize matching or slightly underperforming a hand-written Python loop on the same operation.

Use np.vectorize when you want the convenience of applying a scalar Python function to array inputs without writing the loop yourself, and when you have no performance requirement. Do not use it when you expect speedup — you will be disappointed and confused.

What is the fastest way to apply a custom function to each element of a NumPy array?

It depends on what the function does.

If the function can be expressed using NumPy operations — arithmetic, comparisons, math functions, boolean logic — express it that way directly. This is the fastest option: the loop runs in compiled C with SIMD and there is no Python overhead per element.

If the function cannot be expressed with NumPy operations because it has sequential dependencies, use Numba: decorate with @numba.njit and it compiles to LLVM machine code on the first call. Subsequent calls run at near-C speed. This is the right answer for recurrences, state machines, and operations where output[i] depends on output[i-1].

If the function calls external Python code or libraries that Numba cannot compile, there is no way to eliminate the per-element Python overhead. In that case, focus on batching the calls to reduce the per-element fixed cost, or use multiprocessing if each element's computation is large enough to justify serialization overhead.

Can I parallelise a Python loop over a NumPy array using multiprocessing?

Technically yes, practically rarely the right answer. The serialisation cost of passing large NumPy arrays between processes — pickling, copying, unpickling — often consumes more time than the computation being parallelised. For arrays above ~10MB, the IPC overhead typically dominates.

For CPU-bound element-wise operations, vectorisation or Numba @njit is faster and simpler because the computation stays in a single process with no serialisation overhead.

Multiprocessing earns its place when each element requires significant independent computation that dwarfs the serialisation cost — for example, running ML model inference or a complex simulation per element, where each call takes tens of milliseconds. In that scenario, the per-element work dominates and parallelism across processes provides real throughput improvement. Use concurrent.futures.ProcessPoolExecutor for this pattern and benchmark with actual array sizes before committing to the architecture.

How do I know if my loop can be vectorised?

Ask one question: does computing result[i] require the value of result[i-1] or any earlier output?

If the answer is no — result[i] depends only on input values, not on prior outputs — the loop can be vectorised. The computation at each position is independent of all others, which is the only structural requirement for vectorisation. Find the NumPy equivalent (broadcasting, ufunc, np.where, cumsum trick) and replace the loop.

If the answer is yes — computing result[i] requires result[i-1] — the loop has a sequential dependency and cannot be vectorised with standard NumPy. Use Numba @njit to preserve the sequential structure while compiling it to machine code.

A common source of confusion: loops that read from input[i-1] but write to result[i] are NOT sequential in the relevant sense. Reading from the input array at any position is fine — the input doesn't change. The constraint is only on reading from previously computed output values.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousNumPy Boolean Indexing and Fancy IndexingNext →NumPy with Pandas — How They Work Together
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged