Python Intermediate

NumPy Random Module — Generating and Controlling Random Data

Q: What is the difference between np.random.seed() and np.random.default_rng()?

np.random.seed() sets a global seed that affects all legacy numpy.random functions. np.random.default_rng() creates an independent Generator object. The Generator approach is better because it avoids shared global state — multiple generators with different seeds can run independently in the same process.

Q: Why does my random data change every time I run my script?

You are not seeding the generator. Add seed=42 (or any integer) to np.random.default_rng(). The exact number does not matter — what matters is that you use the same number consistently.

Q: Can I mix legacy and Generator API in the same script?

Yes, but avoid it. The legacy API uses a global RandomState that can interact unpredictably with Generator objects. Migrate all code to Generator for consistency.

Q: How do I generate the same random numbers in Python 2 and Python 3?

Use the bit-level Generator API and a fixed seed. In Python 2, use np.random.RandomState(seed).randn() (legacy) but note that Python 2 is unsupported. Better: use the same NumPy version and the Generator API which is stable within major versions.

📅 March 16, 2026 ⏱ 3 min read 🎯 Intermediate

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Python Libraries → Topic 29 of 51

NumPy random number generation with the modern Generator API, seeding for reproducibility, common distributions, and random sampling without replacement.

⚙️ Intermediate — basic Python knowledge assumed

In this tutorial, you'll learn

NumPy random number generation with the modern Generator API, seeding for reproducibility, common distributions, and random sampling without replacement.

Use np.random.default_rng(seed) for new code — it is faster and better than the legacy API.
Seeding makes random numbers reproducible — essential for ML experiments.
rng.shuffle() modifies in place; rng.permutation() returns a copy.

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

NumPy 1.17+ recommends the Generator API: rng = np.random.default_rng(seed=42)
Generator API is faster and statistically better than legacy np.random.rand() functions
Use rng.random(), rng.integers(), rng.normal(), rng.choice() for most tasks
The legacy API uses a global state; the Generator API creates independent random number generators
Seeding ensures reproducibility: same seed → same sequence every run
Performance: Generator API is ~30% faster for single-threaded random draws

🚨 START HERE

Quick Debug Cheat Sheet: NumPy Random

When randomness breaks and you need answers fast.

🟡I can't reproduce a random number sequence

Immediate ActionCheck that every random call uses the same generator object with a fixed seed.

Commands

print(rng.bit_generator.state['state']['state']) # low-level state check

np.random.get_state() # only for legacy API; check if modified elsewhere

Fix NowRerun with np.random.default_rng(42) at the very top of your script and ensure no other random initialisation happens.

🟡Legacy np.random.seed() seems to have no effect

Immediate ActionLook for hidden calls to np.random.seed() or numpy.random.seed() (note: numpy vs np alias). Also check if a library imports numpy and sets its own seed.

Commands

grep -r 'numpy.random.seed' . --include='*.py'

In the script, insert a sys.addaudithook to log seed changes: sys.setprofile? Not ideal; better to replace all legacy calls with Generator.

Fix NowTemporarily override np.random.seed to raise an error: np.random.seed = lambda x: (_ for _ in ()).throw(Exception('Stop!'))) to catch unexpected calls.

Production IncidentThe Unreproducible ML ExperimentA team spent 3 days chasing a 0.5% accuracy improvement that turned out to be random noise. The root cause: global random state from legacy API colliding with data augmentation code.

SymptomCross-validation results varied wildly between runs even with the same seed. The validation accuracy would jump ±3% across five identical runs.

AssumptionThe team assumed np.random.seed(42) called at the top of the training script would make everything reproducible. They believed all random operations used the same seed.

Root causeTwo modules called np.random.seed() independently: the data loader used np.random.seed(int(time.time())) to shuffle, overwriting the global seed. Additionally, the augmentation library used the legacy np.random.rand() which respects the global state. The seed was not passed explicitly.

FixRefactored to use explicit Generator objects everywhere. DataLoader received its own generator seeded from a config hash. Augmentation code switched to rng.normal() etc. The training script now saves the full generator state (rng.bit_generator.state) for exact restarts.

Key Lesson

Never rely on a single global seed for a complex codebase. Pass explicit generators or seeds.Use a deterministic seed derived from the experiment configuration, not the current time.Log the full generator state along with models to allow perfect reproducibility of failures.

Production Debug GuideSymptom → Action guide for common NumPy random issues

Random numbers not reproducible across script runs→Check if all random operations use the same Generator. If any code calls np.random.rand() without a seed or uses a different Generator, the sequence diverges. Add a logging statement that prints rng.bit_generator.state['state']['state'] after setup.

Random numbers differ when code is parallelized→In each parallel worker, create a new Generator with a unique seed (e.g., base_seed + worker_id). Verify that no two workers share the same generator object.

Random outputs change between Python environments→Check NumPy version: Generator API guarantees bit-identical sequences across patch versions, but not across major releases. Pin numpy>=1.17,<1.25 for reproducible builds.

The Modern Generator API

Create a generator with np.random.default_rng(). Pass a seed for reproducibility. The generator object is independent — you can have multiple generators with different seeds without interference. Use it for all subsequent random operations.

Example · PYTHON

123456789

import numpy as np

# Reproducible — same seed gives same numbers every run
rng = np.random.default_rng(seed=42)

print(rng.random(5))          # 5 floats in [0, 1)
print(rng.integers(0, 10, 5)) # 5 ints in [0, 10)
print(rng.normal(0, 1, 5))    # 5 standard normal samples
print(rng.uniform(2.0, 5.0, 3)) # 3 floats in [2, 5)

▶ Output

[0.773 0.438 0.858 0.697 0.094]
[0 9 5 0 2]
[-0.234 1.573 -0.462 0.241 -1.913]

📊 Production Insight

A global np.random.seed() call in one module affects random operations in unrelated modules. This breaks reproducibility when refactoring code.

Always use per-generator seeds to isolate random state across components.

Use this pattern: rng = np.random.default_rng(seed=42) and pass rng explicitly to functions.

🎯 Key Takeaway

Generator API is the standard for new code.

Each generator is independent — no shared state.

Seed your generator to guarantee identical output across runs.

Common Distributions

The Generator API supports all standard distributions: normal, binomial, poisson, exponential, uniform, and more. Each distribution function accepts shape parameters and a size argument to produce arrays.

Example · PYTHON

1234567891011121314

import numpy as np
rng = np.random.default_rng(0)

# Normal (Gaussian)
print(rng.normal(loc=170, scale=10, size=5))  # heights in cm

# Binomial — n trials, p probability
print(rng.binomial(n=10, p=0.5, size=5))  # coin flips

# Poisson — events per interval
print(rng.poisson(lam=3, size=5))

# Exponential — time between events
print(rng.exponential(scale=1.0, size=5))

▶ Output

[175.4 164.8 171.2 167.9 182.3]
[4 5 7 5 3]

📊 Production Insight

Distributions with small probabilities (e.g., binomial p=0.001) can overflow in the legacy API; the Generator API uses higher-precision algorithms.

Always specify dtype for integer distributions to avoid unnecessary memory usage.

Watch for extreme values: exponential with small scale can produce rare spikes that crash downstream logic.

🎯 Key Takeaway

Know the shape parameters for each distribution.

Use size to generate arrays, not loops.

Check edge cases: very small probabilities may cause numerical instability.

Shuffling and Sampling

Shuffle arrays in place with rng.shuffle() or get a permuted copy with rng.permutation(). For random sampling from an array without replacement, use rng.choice(replace=False). For bootstrap sampling, set replace=True.

Example · PYTHON

1234567891011121314151617181920

import numpy as np
rng = np.random.default_rng(42)

arr = np.arange(10)

# Shuffle in place
rng.shuffle(arr)
print(arr)  # [0 3 7 2 5 1 9 4 6 8] — order varies

# Sample without replacement
print(rng.choice(arr, size=3, replace=False))

# Sample with replacement (bootstrap)
print(rng.choice(arr, size=5, replace=True))

# Permutation — returns a copy, does not modify original
orig = np.arange(5)
shuffled = rng.permutation(orig)
print(orig)     # [0 1 2 3 4] — unchanged
print(shuffled) # shuffled copy

▶ Output

[ 0 3 7 2 5 1 9 4 6 8]
[0 1 2 3 4]
[3 7 2]

📊 Production Insight

rng.shuffle() modifies the original array — if the array is shared across functions, unexpected mutations occur.

Sampling without replacement from a large array is O(n) — for massive datasets, consider using permutation and slicing.

Bootstrap sampling with replace=True generates repeated indices — this can double memory usage if you store the sample separately.

🎯 Key Takeaway

shuffle modifies in place; permutation returns a copy.

choice with replace=False is true random sampling.

Bootstrap with replace=True creates a new sample of same size with possible duplicates.

Seeding Strategies and Reproducibility

Seeding controls the initial state of the generator. For reproducibility, use a fixed integer seed. For distributed systems, ensure each process gets a unique but reproducible seed (e.g., based on process rank). For testing, consider using a seed derived from the test name to isolate test randomness.

Example · PYTHON

12345678910111213141516

import numpy as np
from hashlib import sha256

# Good: fixed seed
rng = np.random.default_rng(seed=42)

# Better: unique seed per process (e.g., MPI rank)
process_id = 0  # from MPI
seed = int(sha256(b"my_experiment").hexdigest(), 16) + process_id
rng_rank = np.random.default_rng(seed)

# Testing: seed from test name
def my_test_function():
    test_seed = int(sha256(b"my_test_function").hexdigest(), 16) % 2**32
    rng_local = np.random.default_rng(test_seed)
    # ... use rng_local

📊 Production Insight

Reusing the same seed across multiple experiments can lead to accidental correlation — use a dataset hash or timestamp combined with a fixed base seed.

In parallel processing, calling default_rng() without an explicit seed inside workers can generate the same sequence (common bug with fork).

Always log the seed used for each run to enable post-hoc debugging.

🎯 Key Takeaway

Seed once per run, not per call.

Use process-unique seeds in distributed environments.

Log seeds to reproduce production outcomes later.

Performance Considerations and Vectorisation

Generator API is vectorised — always generate arrays of samples in one call rather than looping. The performance gain is 10-100x for large sizes. Additionally, use dtype parameters for integer and float precision to control memory and speed.

Example · PYTHON

123456789101112131415

import numpy as np
import time

rng = np.random.default_rng(0)

# Slow: loop
start = time.time()
for _ in range(100000):
    rng.random()
print("Loop time:", time.time() - start)

# Fast: vectorised
start = time.time()
rng.random(100000)
print("Vectorised time:", time.time() - start)

📊 Production Insight

The legacy API uses a C-level lock that serialises calls from multiple threads — Generator API avoids that lock, but you still need separate generators per thread for true parallelism.

Memory for random arrays can be large; generate only what you need and discard immediately.

For floating-point precision, the default is float64 — use dtype=np.float32 to halve memory use and speed up generation.

🎯 Key Takeaway

Generate arrays, not scalars.

Use dtype control for memory and speed.

One generator per thread for thread-safe parallelism.

🗂 Generator API vs Legacy API

Key differences that affect reproducibility and performance in production

Feature	Legacy API (np.random.seed)	Generator API (default_rng)
Recommended for new code	No	Yes
Global state	Single global RandomState	Independent per generator
Thread safety	Global lock serialises	No global lock; use per-thread
Speed (single thread)	Baseline	~30% faster
Seed multiple streams	Impossible without hacks	Create multiple generators
New distributions	Limited	More algorithms, better accuracy
Reproducibility across versions	Unstable across NumPy versions	Stable within major versions

🎯 Key Takeaways

Use np.random.default_rng(seed) for new code — it is faster and better than the legacy API.
Seeding makes random numbers reproducible — essential for ML experiments.
rng.shuffle() modifies in place; rng.permutation() returns a copy.
rng.choice() with replace=False is sampling without replacement.
Each call to a Generator method advances the internal state — the same rng object produces different numbers on consecutive calls.
In distributed systems, give each worker a unique seed derived from a base seed to avoid identical sequences.
Log the generator state (seed or BitGenerator state) with every experiment for full reproducibility.

⚠ Common Mistakes to Avoid

✕Using np.random.seed() with multiprocessing

Symptom

Each worker process produces the same random sequence because they all inherit the same random state from the parent process after fork.

Fix

In each worker, create a fresh Generator with a process-unique seed: rng = np.random.default_rng(seed=global_seed + os.getpid())

✕Relying on the legacy API without upgrading

Symptom

Monte Carlo simulations that worked in NumPy 1.16 produce different results after upgrading to 1.17+, breaking regression tests.

Fix

Pin NumPy version or migrate all random calls to the Generator API. The Generator API is forward-compatible and faster.

✕Calling random functions directly without a generator

Symptom

Code using np.random.rand() inside a function becomes non-reproducible when called from different contexts because the global state changes.

Fix

Always accept an optional generator parameter: def my_func(rng=None): if rng is None: rng = np.random.default_rng(). Then use rng.random() internally.

Interview Questions on This Topic

QWhy is np.random.default_rng() preferred over np.random.seed() in modern NumPy?Mid-levelReveal
The Generator API provides independent random number streams, avoiding global state pollution. It's faster, supports more distributions, and gives better statistical quality. The legacy API's global RandomState makes it impossible to have isolated random sequences in different parts of a program, leading to reproducibility failures in large codebases.
QHow do you generate reproducible random numbers in NumPy?SeniorReveal
Use np.random.default_rng(seed=some_integer) to create a seeded generator. Pass that generator explicitly to all functions that need randomness. To reproduce across runs, keep the seed constant. For more control, save the full generator state (rng.bit_generator.state) at critical points and restore it later. Never rely on the global np.random.seed() in production code.

Frequently Asked Questions

What is the difference between np.random.seed() and np.random.default_rng()?

np.random.seed() sets a global seed that affects all legacy numpy.random functions. np.random.default_rng() creates an independent Generator object. The Generator approach is better because it avoids shared global state — multiple generators with different seeds can run independently in the same process.

Why does my random data change every time I run my script?

You are not seeding the generator. Add seed=42 (or any integer) to np.random.default_rng(). The exact number does not matter — what matters is that you use the same number consistently.

Can I mix legacy and Generator API in the same script?

Yes, but avoid it. The legacy API uses a global RandomState that can interact unpredictably with Generator objects. Migrate all code to Generator for consistency.

How do I generate the same random numbers in Python 2 and Python 3?

Use the bit-level Generator API and a fixed seed. In Python 2, use np.random.RandomState(seed).randn() (legacy) but note that Python 2 is unsupported. Better: use the same NumPy version and the Generator API which is stable within major versions.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged