Skip to content
Home Python NumPy Random Module — Generating and Controlling Random Data

NumPy Random Module — Generating and Controlling Random Data

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Python Libraries → Topic 29 of 51
NumPy random number generation with the modern Generator API, seeding for reproducibility, common distributions, and random sampling without replacement.
⚙️ Intermediate — basic Python knowledge assumed
In this tutorial, you'll learn
NumPy random number generation with the modern Generator API, seeding for reproducibility, common distributions, and random sampling without replacement.
  • Use np.random.default_rng(seed) for new code — it is faster and better than the legacy API.
  • Seeding makes random numbers reproducible — essential for ML experiments.
  • rng.shuffle() modifies in place; rng.permutation() returns a copy.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • NumPy 1.17+ recommends the Generator API: rng = np.random.default_rng(seed=42)
  • Generator API is faster and statistically better than legacy np.random.rand() functions
  • Use rng.random(), rng.integers(), rng.normal(), rng.choice() for most tasks
  • The legacy API uses a global state; the Generator API creates independent random number generators
  • Seeding ensures reproducibility: same seed → same sequence every run
  • Performance: Generator API is ~30% faster for single-threaded random draws
🚨 START HERE
Quick Debug Cheat Sheet: NumPy Random
When randomness breaks and you need answers fast.
🟡I can't reproduce a random number sequence
Immediate ActionCheck that every random call uses the same generator object with a fixed seed.
Commands
print(rng.bit_generator.state['state']['state']) # low-level state check
np.random.get_state() # only for legacy API; check if modified elsewhere
Fix NowRerun with np.random.default_rng(42) at the very top of your script and ensure no other random initialisation happens.
🟡Legacy np.random.seed() seems to have no effect
Immediate ActionLook for hidden calls to np.random.seed() or numpy.random.seed() (note: numpy vs np alias). Also check if a library imports numpy and sets its own seed.
Commands
grep -r 'numpy.random.seed' . --include='*.py'
In the script, insert a sys.addaudithook to log seed changes: sys.setprofile? Not ideal; better to replace all legacy calls with Generator.
Fix NowTemporarily override np.random.seed to raise an error: np.random.seed = lambda x: (_ for _ in ()).throw(Exception('Stop!'))) to catch unexpected calls.
Production IncidentThe Unreproducible ML ExperimentA team spent 3 days chasing a 0.5% accuracy improvement that turned out to be random noise. The root cause: global random state from legacy API colliding with data augmentation code.
SymptomCross-validation results varied wildly between runs even with the same seed. The validation accuracy would jump ±3% across five identical runs.
AssumptionThe team assumed np.random.seed(42) called at the top of the training script would make everything reproducible. They believed all random operations used the same seed.
Root causeTwo modules called np.random.seed() independently: the data loader used np.random.seed(int(time.time())) to shuffle, overwriting the global seed. Additionally, the augmentation library used the legacy np.random.rand() which respects the global state. The seed was not passed explicitly.
FixRefactored to use explicit Generator objects everywhere. DataLoader received its own generator seeded from a config hash. Augmentation code switched to rng.normal() etc. The training script now saves the full generator state (rng.bit_generator.state) for exact restarts.
Key Lesson
Never rely on a single global seed for a complex codebase. Pass explicit generators or seeds.Use a deterministic seed derived from the experiment configuration, not the current time.Log the full generator state along with models to allow perfect reproducibility of failures.
Production Debug GuideSymptom → Action guide for common NumPy random issues
Random numbers not reproducible across script runsCheck if all random operations use the same Generator. If any code calls np.random.rand() without a seed or uses a different Generator, the sequence diverges. Add a logging statement that prints rng.bit_generator.state['state']['state'] after setup.
Random numbers differ when code is parallelizedIn each parallel worker, create a new Generator with a unique seed (e.g., base_seed + worker_id). Verify that no two workers share the same generator object.
Random outputs change between Python environmentsCheck NumPy version: Generator API guarantees bit-identical sequences across patch versions, but not across major releases. Pin numpy>=1.17,<1.25 for reproducible builds.

The Modern Generator API

Create a generator with np.random.default_rng(). Pass a seed for reproducibility. The generator object is independent — you can have multiple generators with different seeds without interference. Use it for all subsequent random operations.

Example · PYTHON
123456789
import numpy as np

# Reproducible — same seed gives same numbers every run
rng = np.random.default_rng(seed=42)

print(rng.random(5))          # 5 floats in [0, 1)
print(rng.integers(0, 10, 5)) # 5 ints in [0, 10)
print(rng.normal(0, 1, 5))    # 5 standard normal samples
print(rng.uniform(2.0, 5.0, 3)) # 3 floats in [2, 5)
▶ Output
[0.773 0.438 0.858 0.697 0.094]
[0 9 5 0 2]
[-0.234 1.573 -0.462 0.241 -1.913]
📊 Production Insight
A global np.random.seed() call in one module affects random operations in unrelated modules. This breaks reproducibility when refactoring code.
Always use per-generator seeds to isolate random state across components.
Use this pattern: rng = np.random.default_rng(seed=42) and pass rng explicitly to functions.
🎯 Key Takeaway
Generator API is the standard for new code.
Each generator is independent — no shared state.
Seed your generator to guarantee identical output across runs.

Common Distributions

The Generator API supports all standard distributions: normal, binomial, poisson, exponential, uniform, and more. Each distribution function accepts shape parameters and a size argument to produce arrays.

Example · PYTHON
1234567891011121314
import numpy as np
rng = np.random.default_rng(0)

# Normal (Gaussian)
print(rng.normal(loc=170, scale=10, size=5))  # heights in cm

# Binomial — n trials, p probability
print(rng.binomial(n=10, p=0.5, size=5))  # coin flips

# Poisson — events per interval
print(rng.poisson(lam=3, size=5))

# Exponential — time between events
print(rng.exponential(scale=1.0, size=5))
▶ Output
[175.4 164.8 171.2 167.9 182.3]
[4 5 7 5 3]
📊 Production Insight
Distributions with small probabilities (e.g., binomial p=0.001) can overflow in the legacy API; the Generator API uses higher-precision algorithms.
Always specify dtype for integer distributions to avoid unnecessary memory usage.
Watch for extreme values: exponential with small scale can produce rare spikes that crash downstream logic.
🎯 Key Takeaway
Know the shape parameters for each distribution.
Use size to generate arrays, not loops.
Check edge cases: very small probabilities may cause numerical instability.

Shuffling and Sampling

Shuffle arrays in place with rng.shuffle() or get a permuted copy with rng.permutation(). For random sampling from an array without replacement, use rng.choice(replace=False). For bootstrap sampling, set replace=True.

Example · PYTHON
1234567891011121314151617181920
import numpy as np
rng = np.random.default_rng(42)

arr = np.arange(10)

# Shuffle in place
rng.shuffle(arr)
print(arr)  # [0 3 7 2 5 1 9 4 6 8] — order varies

# Sample without replacement
print(rng.choice(arr, size=3, replace=False))

# Sample with replacement (bootstrap)
print(rng.choice(arr, size=5, replace=True))

# Permutation — returns a copy, does not modify original
orig = np.arange(5)
shuffled = rng.permutation(orig)
print(orig)     # [0 1 2 3 4] — unchanged
print(shuffled) # shuffled copy
▶ Output
[ 0 3 7 2 5 1 9 4 6 8]
[0 1 2 3 4]
[3 7 2]
📊 Production Insight
rng.shuffle() modifies the original array — if the array is shared across functions, unexpected mutations occur.
Sampling without replacement from a large array is O(n) — for massive datasets, consider using permutation and slicing.
Bootstrap sampling with replace=True generates repeated indices — this can double memory usage if you store the sample separately.
🎯 Key Takeaway
shuffle modifies in place; permutation returns a copy.
choice with replace=False is true random sampling.
Bootstrap with replace=True creates a new sample of same size with possible duplicates.

Seeding Strategies and Reproducibility

Seeding controls the initial state of the generator. For reproducibility, use a fixed integer seed. For distributed systems, ensure each process gets a unique but reproducible seed (e.g., based on process rank). For testing, consider using a seed derived from the test name to isolate test randomness.

Example · PYTHON
12345678910111213141516
import numpy as np
from hashlib import sha256

# Good: fixed seed
rng = np.random.default_rng(seed=42)

# Better: unique seed per process (e.g., MPI rank)
process_id = 0  # from MPI
seed = int(sha256(b"my_experiment").hexdigest(), 16) + process_id
rng_rank = np.random.default_rng(seed)

# Testing: seed from test name
def my_test_function():
    test_seed = int(sha256(b"my_test_function").hexdigest(), 16) % 2**32
    rng_local = np.random.default_rng(test_seed)
    # ... use rng_local
📊 Production Insight
Reusing the same seed across multiple experiments can lead to accidental correlation — use a dataset hash or timestamp combined with a fixed base seed.
In parallel processing, calling default_rng() without an explicit seed inside workers can generate the same sequence (common bug with fork).
Always log the seed used for each run to enable post-hoc debugging.
🎯 Key Takeaway
Seed once per run, not per call.
Use process-unique seeds in distributed environments.
Log seeds to reproduce production outcomes later.

Performance Considerations and Vectorisation

Generator API is vectorised — always generate arrays of samples in one call rather than looping. The performance gain is 10-100x for large sizes. Additionally, use dtype parameters for integer and float precision to control memory and speed.

Example · PYTHON
123456789101112131415
import numpy as np
import time

rng = np.random.default_rng(0)

# Slow: loop
start = time.time()
for _ in range(100000):
    rng.random()
print("Loop time:", time.time() - start)

# Fast: vectorised
start = time.time()
rng.random(100000)
print("Vectorised time:", time.time() - start)
📊 Production Insight
The legacy API uses a C-level lock that serialises calls from multiple threads — Generator API avoids that lock, but you still need separate generators per thread for true parallelism.
Memory for random arrays can be large; generate only what you need and discard immediately.
For floating-point precision, the default is float64 — use dtype=np.float32 to halve memory use and speed up generation.
🎯 Key Takeaway
Generate arrays, not scalars.
Use dtype control for memory and speed.
One generator per thread for thread-safe parallelism.
🗂 Generator API vs Legacy API
Key differences that affect reproducibility and performance in production
FeatureLegacy API (np.random.seed)Generator API (default_rng)
Recommended for new codeNoYes
Global stateSingle global RandomStateIndependent per generator
Thread safetyGlobal lock serialisesNo global lock; use per-thread
Speed (single thread)Baseline~30% faster
Seed multiple streamsImpossible without hacksCreate multiple generators
New distributionsLimitedMore algorithms, better accuracy
Reproducibility across versionsUnstable across NumPy versionsStable within major versions

🎯 Key Takeaways

  • Use np.random.default_rng(seed) for new code — it is faster and better than the legacy API.
  • Seeding makes random numbers reproducible — essential for ML experiments.
  • rng.shuffle() modifies in place; rng.permutation() returns a copy.
  • rng.choice() with replace=False is sampling without replacement.
  • Each call to a Generator method advances the internal state — the same rng object produces different numbers on consecutive calls.
  • In distributed systems, give each worker a unique seed derived from a base seed to avoid identical sequences.
  • Log the generator state (seed or BitGenerator state) with every experiment for full reproducibility.

⚠ Common Mistakes to Avoid

    Using np.random.seed() with multiprocessing
    Symptom

    Each worker process produces the same random sequence because they all inherit the same random state from the parent process after fork.

    Fix

    In each worker, create a fresh Generator with a process-unique seed: rng = np.random.default_rng(seed=global_seed + os.getpid())

    Relying on the legacy API without upgrading
    Symptom

    Monte Carlo simulations that worked in NumPy 1.16 produce different results after upgrading to 1.17+, breaking regression tests.

    Fix

    Pin NumPy version or migrate all random calls to the Generator API. The Generator API is forward-compatible and faster.

    Calling random functions directly without a generator
    Symptom

    Code using np.random.rand() inside a function becomes non-reproducible when called from different contexts because the global state changes.

    Fix

    Always accept an optional generator parameter: def my_func(rng=None): if rng is None: rng = np.random.default_rng(). Then use rng.random() internally.

Interview Questions on This Topic

  • QWhy is np.random.default_rng() preferred over np.random.seed() in modern NumPy?Mid-levelReveal
    The Generator API provides independent random number streams, avoiding global state pollution. It's faster, supports more distributions, and gives better statistical quality. The legacy API's global RandomState makes it impossible to have isolated random sequences in different parts of a program, leading to reproducibility failures in large codebases.
  • QHow do you generate reproducible random numbers in NumPy?SeniorReveal
    Use np.random.default_rng(seed=some_integer) to create a seeded generator. Pass that generator explicitly to all functions that need randomness. To reproduce across runs, keep the seed constant. For more control, save the full generator state (rng.bit_generator.state) at critical points and restore it later. Never rely on the global np.random.seed() in production code.

Frequently Asked Questions

What is the difference between np.random.seed() and np.random.default_rng()?

np.random.seed() sets a global seed that affects all legacy numpy.random functions. np.random.default_rng() creates an independent Generator object. The Generator approach is better because it avoids shared global state — multiple generators with different seeds can run independently in the same process.

Why does my random data change every time I run my script?

You are not seeding the generator. Add seed=42 (or any integer) to np.random.default_rng(). The exact number does not matter — what matters is that you use the same number consistently.

Can I mix legacy and Generator API in the same script?

Yes, but avoid it. The legacy API uses a global RandomState that can interact unpredictably with Generator objects. Migrate all code to Generator for consistency.

How do I generate the same random numbers in Python 2 and Python 3?

Use the bit-level Generator API and a fixed seed. In Python 2, use np.random.RandomState(seed).randn() (legacy) but note that Python 2 is unsupported. Better: use the same NumPy version and the Generator API which is stable within major versions.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousNumPy Mathematical Functions — ufuncs, aggregations and statisticsNext →NumPy Linear Algebra — dot, matmul, linalg explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged