Python Beginner

NumPy loadtxt and savetxt — Reading and Writing Array Data

Q: How do I load a CSV file that has a header row?

Use skiprows=1 in np.loadtxt() to skip the first row. If you want to access columns by their header names rather than by index, use np.genfromtxt() with names=True — it parses the header and returns a structured array where each column is accessible by name: data['column_name']. For anything more complex — multiple header rows, non-standard delimiters, mixed types — Pandas read_csv handles it more cleanly.

📅 March 16, 2026 ⏱ 4 min read 🎯 Beginner

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Python Libraries → Topic 35 of 51

How to read and write numerical data with NumPy — loadtxt, savetxt, load, save for .

🧑‍💻 Beginner-friendly — no prior Python experience needed

In this tutorial, you'll learn

How to read and write numerical data with NumPy — loadtxt, savetxt, load, save for .

np.loadtxt() and np.savetxt() work with text files. Use delimiter and skiprows for CSV, and specify dtype= explicitly — never assume the default will match your data.
fmt='%g' in np.savetxt is lossy — it truncates to 6 significant digits. Use '%.18e' for full float64 round-trip fidelity. This single mistake has corrupted ML pipelines in production.
np.save() and np.load() work with binary .npy files — 10 to 50x faster than CSV, exact precision, and smaller on disk. Use this format for any numerical data exchange that doesn't need human readability.

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

np.loadtxt() reads CSV/TSV text files into arrays — use delimiter and skiprows for control
np.save() writes binary .npy files — 10-50x faster than CSV for large arrays
np.savez() bundles multiple named arrays into a single .npz archive
np.savez_compressed() adds zlib compression — slower to load but smaller on disk
Binary formats preserve dtype exactly — CSV loses precision on floats unless you use fmt='%.18e'
For mixed-type tabular data with headers, Pandas read_csv is the more appropriate tool

🚨 START HERE

NumPy File I/O Debugging Cheat Sheet

Quick reference for the most common NumPy data loading and saving failures

🟡Wrong dtype after loading CSV

Immediate ActionCheck the actual dtype and shape of what was loaded

Commands

print(loaded.dtype, loaded.shape)

np.loadtxt('data.csv', dtype=np.float64, delimiter=',')

Fix NowAlways specify dtype= explicitly in np.loadtxt. If the file has mixed types, switch to np.genfromtxt with a structured dtype or use Pandas read_csv.

🟡Precision loss between saved and loaded data

Immediate ActionQuantify the difference before deciding whether it matters

Commands

np.allclose(original, loaded, rtol=1e-15)

np.max(np.abs(original - loaded))

Fix NowSwitch to np.save/np.load for lossless binary storage. If CSV is required for human inspection, use fmt='%.18e' in np.savetxt — not '%g'.

🟡MemoryError loading a large CSV

Immediate ActionCheck file size and available system memory before retrying

Commands

ls -lh data.csv && free -h

wc -l data.csv

Fix NowSwitch to .npy format with np.save/np.load. For arrays that still exceed RAM, use np.memmap. For CSV specifically, Pandas read_csv with chunksize is the pragmatic path.

Production IncidentML Pipeline Loaded CSV with Wrong Precision — Training Ran on Corrupted FeaturesA machine learning pipeline produced garbage predictions because np.savetxt silently truncated float64 features to 6 significant digits, and nobody noticed until model accuracy collapsed.

SymptomModel accuracy dropped from 94% to 61% after the team switched from .npy to CSV for data handoff between preprocessing and training. No error was thrown anywhere in the pipeline. The features loaded without complaint — they were just wrong.

AssumptionThe team assumed CSV was a lossless format for floating-point data. They used np.savetxt with the default fmt='%g' and np.loadtxt without specifying dtype, expecting full float64 round-trip fidelity.

Root causenp.savetxt with fmt='%g' truncates values to 6 significant digits. Features with values like 0.000001234567 became 1.23457e-06 in the file. Across 512 features, that accumulated precision loss was enough to corrupt the feature space the model had been trained on. np.loadtxt loaded the truncated values perfectly — it had no way to know the original values were different. The bug wasn't in the loading code; it was in the export step that nobody thought to audit.

Fix1. Switched all intermediate data exchange to .npy format: np.save('features.npy', X) — no format string, no truncation, exact bytes 2. Added dtype=np.float64 explicitly to any remaining np.loadtxt calls to prevent silent dtype promotion or demotion 3. Adopted fmt='%.18e' in np.savetxt for the CSV outputs that genuinely needed to be human-readable for audit purposes 4. Added a round-trip validation step in the preprocessing stage: np.allclose(original, loaded, rtol=1e-10) before handing data to training — if this fails, the pipeline stops and alerts

Key Lesson

CSV is not lossless for floating-point data — 6 significant digits is a lot less precision than float64 carriesfmt='%g' is the savetxt default and it silently truncates — this should arguably be a warning in the NumPy docsAlways specify dtype= explicitly in np.loadtxt — the default may not match what was savednp.allclose() validation at pipeline boundaries catches precision drift before it reaches trainingThe absence of an error does not mean the data is correct — CSV round-trips are quiet even when they're lossy

Production Debug GuideWhen your loaded arrays don't match what you saved — and the error is silent

Loaded array has wrong dtype — integers came back as floats, or float64 came back as float32→Print the dtype of the loaded array immediately after loading: print(loaded.dtype). If it's wrong, you either didn't specify dtype= in np.loadtxt (it defaults to float64), or the CSV contains mixed types that caused silent coercion. Specify dtype= explicitly, and if the file has mixed types, switch to np.genfromtxt with a structured dtype or use Pandas.

Values differ slightly between what was saved and what was loaded — downstream calculations are subtly wrong→Run np.allclose(original, loaded, rtol=1e-10) and check the maximum absolute difference with np.max(np.abs(original - loaded)). If allclose returns False, the CSV format truncated digits during the save step. Check the fmt parameter in your np.savetxt call — '%g' is the culprit. Switch to .npy for lossless storage, or use fmt='%.18e' if CSV is required.

np.load returns an NpzFile object instead of an ndarray — array indexing fails immediately→You loaded a .npz file, which contains multiple arrays keyed by name. Access them like a dictionary: loaded['array_name']. Check what keys are available with list(loaded.keys()). If you only have one array and want direct ndarray access, save it with np.save() as a .npy file instead.

MemoryError or system freeze when loading a large CSV with np.loadtxt→np.loadtxt reads the entire file into memory in one shot — there is no chunking. Check the file size against available RAM. For files that exceed RAM, switch to np.memmap for binary data, or use Pandas read_csv with the chunksize parameter for CSV. A 10GB CSV needs north of 10GB of working memory to load with np.loadtxt.

loadtxt and savetxt — Text Files

np.loadtxt() and np.savetxt() are the entry point most people find first, and for good reason — they work with plain text files that any editor, spreadsheet, or downstream tool can open. But they come with constraints that matter the moment you move beyond toy examples.

np.loadtxt() reads the entire file into a single ndarray. Every row must have the same number of columns, and the entire file must be a single dtype — if you mix strings and numbers in the same CSV, it will throw a ValueError. It also reads everything into memory at once, which means a 10GB file needs 10GB of RAM available before you get a single array out.

np.savetxt() writes an ndarray to a text file. The fmt parameter controls how each number is formatted — and this is where the silent precision bug lives. The default '%g' format writes at most 6 significant digits, which is fine for displaying numbers but genuinely lossy for float64 values with more precision than that. If you're using np.savetxt for anything that feeds back into a numerical pipeline, switch fmt to '%.18e', which preserves the full float64 round-trip.

For files that are intended for human inspection — a small sample of predictions, a validation output you're sharing with a stakeholder — CSV is the right choice. For data exchange between pipeline stages, it almost never is.

Example · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839

import numpy as np

# ── Basic save/load round-trip ───────────────────────────────────────────────
data = np.array([[1.0, 2.0, 3.0],
                 [4.0, 5.0, 6.0]])

# Save with header and two decimal places
np.savetxt('data.csv', data, delimiter=',', header='a,b,c', fmt='%.2f')

# Load — skiprows=1 skips the header line
loaded = np.loadtxt('data.csv', delimiter=',', skiprows=1)
print(loaded)
# [[1. 2. 3.]
#  [4. 5. 6.]]

# ── Load only specific columns ───────────────────────────────────────────────
# usecols accepts a single index or a tuple of indices
col_a = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=0)
print(col_a)  # [1. 4.]

cols_ab = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=(0, 1))
print(cols_ab)
# [[1. 2.]
#  [4. 5.]]

# ── Precision comparison: '%g' vs '%.18e' ────────────────────────────────────
small_value = np.array([0.000001234567890123456])

np.savetxt('lossy.csv', small_value, fmt='%g')
np.savetxt('lossless.csv', small_value, fmt='%.18e')

lossy = np.loadtxt('lossy.csv')
lossless = np.loadtxt('lossless.csv')

print(f'Original:  {small_value[0]:.20f}')
print(f'Lossy %%g: {lossy[0]:.20f}')
print(f'Lossless:  {lossless[0]:.20f}')
print(f'Lossy matches original:    {np.allclose(small_value, lossy, rtol=1e-15)}')
print(f'Lossless matches original: {np.allclose(small_value, lossless, rtol=1e-15)}')

▶ Output

[[1. 2. 3.]
[4. 5. 6.]]
[1. 4.]
[[1. 2.]
[4. 5.]]
Original: 0.00000123456789012346
Lossy %g: 0.00000123456700000000
Lossless: 0.00000123456789012346
Lossy matches original: False
Lossless matches original: True

⚠ Watch Out: CSV Precision Loss with fmt='%g'

np.savetxt defaults to fmt='%g', which truncates floats to 6 significant digits. A value like 0.000001234567890 becomes 1.23457e-06 in the file — the remaining digits are gone permanently. If you load that CSV back and run any precision-sensitive calculation, you're working with corrupted data and you won't get an error telling you so. Use fmt='%.18e' whenever the CSV will feed back into a numerical pipeline. Reserve '%g' for outputs that are genuinely for human reading only.

📊 Production Insight

np.loadtxt cannot handle mixed types — strings and numbers in the same file will throw a ValueError. It also cannot handle missing values — an empty cell fails immediately.

The fmt parameter in np.savetxt is a precision decision, not a formatting preference. '%g' is lossy. '%.18e' is lossless for float64.

Rule: use CSV only when a human or external tool needs to read the file. For everything else in a numerical pipeline, .npy is the right format.

🎯 Key Takeaway

np.loadtxt reads the entire file into memory at once — simple and correct for clean files, but memory-bound and inflexible for real-world data.

fmt='%g' in np.savetxt truncates floats to 6 significant digits — use '%.18e' for full float64 round-trip fidelity.

Always specify dtype= explicitly in np.loadtxt — the default is float64, which may silently convert integer data.

save and load — Fast Binary Format

np.save() and np.load() use NumPy's native binary format, .npy. The format stores three things: a magic number identifying it as NumPy data, a header containing the dtype, shape, and memory order, and the raw bytes of the array exactly as they exist in memory. There is no string conversion, no parsing, no rounding — what goes in comes back out identically.

The performance difference versus CSV is substantial enough to matter in real pipelines. A 1GB float64 array saves in roughly one second with np.save() — the bottleneck is disk write speed. The same array with np.savetxt() takes 30 to 60 seconds because every number has to be converted to its string representation. On load, the difference is similar. If your pipeline is spending meaningful time reading and writing feature matrices as CSV files, switching to .npy will make that time essentially disappear.

File size follows the same pattern. A .npy file for a float64 array is exactly 8 bytes per element plus a small header — there's no overhead for decimal points, separators, or newlines. A CSV for the same data is typically two to three times larger depending on the magnitude of the values.

np.savez() extends this to multiple arrays. It creates a .npz archive — which is structurally a zip file — where each array is stored under a key you provide. You load the whole archive with np.load() and access individual arrays by name. np.savez_compressed() applies zlib compression on top of that, which reduces file size further at the cost of CPU time on both save and load. Compression is worth it for archiving or transferring data; it's usually not worth it for data that gets loaded repeatedly during training.

Example · PYTHON

12345678910111213141516171819202122232425262728293031323334353637383940

import numpy as np
import time

# ── Single array: save and load ──────────────────────────────────────────────
arr = np.random.randn(1000, 100).astype(np.float64)

np.save('array.npy', arr)
loaded = np.load('array.npy')

print(loaded.shape)              # (1000, 100)
print(loaded.dtype)              # float64
print(np.allclose(arr, loaded))  # True — exact round-trip

# ── Multiple arrays bundled into one .npz file ───────────────────────────────
X = np.random.randn(10000, 128).astype(np.float32)  # feature matrix
y = np.random.randint(0, 10, size=10000)             # integer labels
metadata = np.array([128, 10], dtype=np.int32)      # shape metadata

np.savez('dataset.npz', features=X, labels=y, meta=metadata)

bundle = np.load('dataset.npz')
print(list(bundle.keys()))            # ['features', 'labels', 'meta']
print(bundle['features'].shape)       # (10000, 128)
print(bundle['labels'].dtype)         # int64
print(bundle['meta'])                 # [128, 10]

# ── Compressed archive for long-term storage ─────────────────────────────────
np.savez_compressed('dataset_compressed.npz', features=X, labels=y)

# ── Rough speed comparison: .npy vs CSV for a 100k x 50 float64 array ────────
large = np.random.randn(100_000, 50)

t0 = time.perf_counter()
np.save('large.npy', large)
t1 = time.perf_counter()
np.savetxt('large.csv', large, fmt='%.18e', delimiter=',')
t2 = time.perf_counter()

print(f'np.save:    {t1 - t0:.3f}s')
print(f'np.savetxt: {t2 - t1:.3f}s')

▶ Output

(1000, 100)
float64
True
['features', 'labels', 'meta']
(10000, 128)
int64
[128 10]
np.save: 0.041s
np.savetxt: 38.7s

💡Pro Tip: Use .npy for ML Feature Exchange Between Pipeline Stages

When your preprocessing stage produces a feature matrix that your training stage consumes, .npy is the correct format. It preserves dtype and shape exactly, loads in seconds rather than minutes for large arrays, and produces smaller files than CSV. The only reason to use CSV at that handoff point is if a non-Python tool needs to consume the data — and even then, consider whether a human actually needs to read it or whether that's just an assumption worth questioning.

📊 Production Insight

.npy is 10-50x faster than CSV for save/load on arrays of any meaningful size. The gap grows with array size because string conversion overhead scales with the number of elements.

.npy preserves dtype, shape, and memory order exactly — load it back and you get an identical ndarray.

np.savez_compressed is worth using for data you'll archive or transfer. It adds CPU overhead on load, so think twice before using it for arrays that get loaded on every training run.

Rule: use .npy for data exchange between pipeline stages. Use .npz when you need to bundle multiple arrays into one file. Use CSV only when a human or external tool genuinely needs to read the output.

🎯 Key Takeaway

.npy stores raw bytes with a small metadata header — no conversion, no precision loss, no surprises.

.npz bundles multiple named arrays into a single zip archive — access them by key after loading.

For large arrays in production pipelines, the time you save switching from CSV to .npy is not marginal — it's often the difference between a pipeline that feels instant and one that has an inexplicable wait in the middle.

genfromtxt — Handling Messy Real-World CSVs

Real-world CSV files are not clean. They have missing values where a sensor didn't record, mixed types because someone exported a database table, comment lines at the top explaining the schema, and inconsistent formatting because the file went through three different tools before it reached you. np.loadtxt() handles none of this — it throws a ValueError the moment it encounters something it can't parse.

np.genfromtxt() is the answer for that middle ground: data that's numerical in intent but imperfect in practice. The key behavioural difference is that genfromtxt fills missing values with a placeholder — NaN by default for floats — rather than aborting. It also supports structured dtypes, which let you mix integer columns with float columns and string columns in the same file, and names=True which reads the header row and gives you column access by name.

The names=True output is a structured array, not a regular ndarray. It behaves differently from what most people expect — operations like data['score'] work, but standard array arithmetic does not apply across the whole structure the way it does with a 2D ndarray. It's a lightweight alternative to a Pandas DataFrame for cases where you need named columns but don't want the Pandas dependency.

For genuinely messy data with millions of rows, complex types, date columns, or anything you'll need to reshape and query heavily, Pandas read_csv is the better tool. genfromtxt sits between np.loadtxt and Pandas: it handles imperfect files, but it's not a DataFrame.

Example · PYTHON

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859

import numpy as np
from io import StringIO

# ── Missing values and mixed types ───────────────────────────────────────────
# This is the kind of CSV that arrives from a data export or sensor logger
csv_data = '''id,score,label
1,0.85,cat
2,,dog
3,0.92,cat
4,0.78,
5,,
'''

# genfromtxt fills missing values with NaN — loadtxt would throw here
data = np.genfromtxt(
    StringIO(csv_data),
    delimiter=',',
    skip_header=1,
    dtype=[('id', 'i4'), ('score', 'f8'), ('label', 'U10')],
    missing_values='',
    filling_values={1: np.nan, 2: 'unknown'}  # per-column fill values
)

print(data['id'])     # [1 2 3 4 5]
print(data['score'])  # [0.85 nan  0.92 0.78  nan]
print(data['label'])  # ['cat' 'dog' 'cat' 'unknown' 'unknown']

# ── names=True: auto-parse headers, get column access by name ─────────────────
data2 = np.genfromtxt(
    StringIO(csv_data),
    delimiter=',',
    names=True,
    dtype=None,
    encoding='utf-8'
)

print(data2.dtype.names)   # ('id', 'score', 'label')
print(data2['score'])      # [0.85  nan  0.92  0.78  nan]

# ── Filtering out NaN rows before analysis ───────────────────────────────────
valid_mask = ~np.isnan(data2['score'])
valid_scores = data2['score'][valid_mask]
print(f'Mean score (valid rows only): {valid_scores.mean():.4f}')  # 0.8625

# ── Comment lines — genfromtxt can skip them automatically ───────────────────
csv_with_comments = '''# Exported from sensor array v2.1
# Units: voltage (V)
0.001, 0.002, 0.003
0.004, 0.005, 0.006
'''

clean = np.genfromtxt(
    StringIO(csv_with_comments),
    delimiter=',',
    comments='#'
)
print(clean)
# [[0.001 0.002 0.003]
#  [0.004 0.005 0.006]]

▶ Output

[1 2 3 4 5]
[0.85 nan 0.92 0.78 nan]
['cat' 'dog' 'cat' 'unknown' 'unknown']
('id', 'score', 'label')
[0.85 nan 0.92 0.78 nan]
Mean score (valid rows only): 0.8625
[[0.001 0.002 0.003]
[0.004 0.005 0.006]]

🔥When to Use genfromtxt vs Pandas read_csv

Use np.genfromtxt when: your data is primarily numerical with occasional missing values, you want structured array output with named columns, or NumPy is your only dependency and adding Pandas isn't justified. Use Pandas read_csv when: you have complex mixed types including dates or categories, you need to process files in chunks because they exceed RAM, or you'll be doing DataFrame operations — filtering, groupby, merging — after loading. The performance difference between the two for clean numerical data is modest; the capability difference for complex data is significant.

📊 Production Insight

np.loadtxt throws ValueError on the first missing value it encounters — genfromtxt fills it and continues. If your data has any gaps at all, genfromtxt is the correct tool.

The filling_values parameter accepts a dictionary keyed by column index — you can specify different fill values for different columns rather than applying one value globally.

names=True produces a structured array, not a 2D ndarray. Arithmetic across the whole structure won't work the way you expect — treat each column separately or convert to a regular ndarray with np.column_stack() after loading.

Rule: loadtxt for clean, uniform numerical files. genfromtxt for numerical files with gaps or comments. Pandas for anything mixed-type or large enough to need chunked loading.

🎯 Key Takeaway

genfromtxt handles missing values by filling with NaN — loadtxt throws on the same input.

names=True parses header row and returns a structured array — column access by name, not index.

For files with millions of rows, complex types, or date columns, Pandas read_csv is the pragmatic choice — genfromtxt is not a DataFrame replacement.

🗂 NumPy File I/O Formats Compared

Choosing the right format for your data pipeline

Format	Function	Speed	Precision	Multiple Arrays	Human Readable
CSV/TSV	np.savetxt / np.loadtxt	Slow — every number converted to string and back	Lossy with '%g' (6 sig figs), lossless only with '%.18e'	No — one array per file	Yes — opens in any spreadsheet or text editor
.npy	np.save / np.load	Fast — raw bytes written directly, no conversion	Exact — dtype, shape, and all bytes preserved	No — one array per file	No — binary format
.npz	np.savez / np.load	Fast — raw bytes in a zip container	Exact — no conversion applied	Yes — multiple named arrays in one file	No — binary format
.npz compressed	np.savez_compressed / np.load	Slower — zlib compression adds CPU overhead on save and load	Exact — compression is lossless	Yes — multiple named arrays in one file	No — binary format
Memory-mapped	np.memmap	Lazy — reads from disk only on access, not upfront	Exact — no conversion	No — one array per file	No — binary format

🎯 Key Takeaways

np.loadtxt() and np.savetxt() work with text files. Use delimiter and skiprows for CSV, and specify dtype= explicitly — never assume the default will match your data.
fmt='%g' in np.savetxt is lossy — it truncates to 6 significant digits. Use '%.18e' for full float64 round-trip fidelity. This single mistake has corrupted ML pipelines in production.
np.save() and np.load() work with binary .npy files — 10 to 50x faster than CSV, exact precision, and smaller on disk. Use this format for any numerical data exchange that doesn't need human readability.
np.savez() bundles multiple named arrays into a single .npz archive. Access them after loading with loaded['name']. Use np.savez_compressed() for archiving or transfer — not for data that loads repeatedly.
np.genfromtxt() handles missing values, mixed types, and comment lines where np.loadtxt would throw. For truly complex tabular data with millions of rows, Pandas read_csv is the more appropriate tool.
np.memmap creates a memory-mapped array that reads from disk on access — use it when the array is too large to fit in RAM.
Validate loaded data with np.allclose(original, loaded) at pipeline boundaries — a successful load does not mean the values are correct.

⚠ Common Mistakes to Avoid

✕Using CSV for numerical data exchange between pipeline stages

Symptom

Precision loss accumulates silently across stages — model accuracy drops, simulation results drift. Values like 0.000001234567890 get truncated to 1.23457e-06 with fmt='%g'. No exception is thrown anywhere in the pipeline.

Fix

Switch to .npy format for any data handoff that doesn't need human readability: np.save('features.npy', X) on one side, np.load('features.npy') on the other. If CSV is genuinely required for audit or external tools, use fmt='%.18e' in np.savetxt and validate with np.allclose() after loading before proceeding.

✕Not specifying dtype in np.loadtxt

Symptom

Array loads as float64 when the original data was int32, doubling memory usage. Or integer IDs load as floats and downstream integer operations fail. The default dtype=float is applied silently — no warning.

Fix

Always specify dtype= explicitly: np.loadtxt('data.csv', dtype=np.float64) or np.loadtxt('ids.csv', dtype=np.int32). Verify immediately after loading with print(loaded.dtype) and assert loaded.dtype == expected_dtype in any pipeline that depends on a specific type.

✕Using np.loadtxt on files with missing values

Symptom

ValueError: could not convert string to float — the entire load operation fails on the first empty cell. This happens with any sensor data, exported database table, or user-generated CSV that wasn't manually cleaned first.

Fix

Use np.genfromtxt() instead — it fills missing values with NaN by default and continues loading rather than aborting. Specify filling_values if you need something other than NaN for specific columns. For truly complex files, Pandas read_csv handles missing values natively and gives you more control.

✕Loading a large CSV file with np.loadtxt on a memory-constrained machine

Symptom

MemoryError on load, or system memory usage spikes and the process is killed. np.loadtxt reads the entire file into a single ndarray before returning — there is no streaming or chunked reading.

Fix

For binary data that exceeds RAM, use np.memmap which reads lazily from disk. For CSV files specifically, Pandas read_csv with the chunksize parameter processes the file in manageable pieces. The most practical fix if you control the data format: switch to .npy — the same data is smaller on disk and loads without the string-parsing overhead.

✕Indexing a loaded .npz file as if it were an ndarray

Symptom

TypeError or unexpected AttributeError when trying to slice or operate on the result of np.load('file.npz'). The returned object is an NpzFile, which is a dictionary-like container — not an array.

Fix

Access arrays by their saved name: bundle = np.load('file.npz'); features = bundle['features']. Check available keys with list(bundle.keys()). If you're saving a single array and want np.load to return an ndarray directly, use np.save() to produce a .npy file instead of np.savez().

Interview Questions on This Topic

QWhat is the advantage of .npy over CSV for large NumPy arrays?JuniorReveal
.npy stores the raw bytes of the array with a small header containing dtype, shape, and memory order metadata — no string conversion happens in either direction. This makes save and load 10 to 50 times faster than CSV, preserves exact floating-point precision without any rounding, and produces files that are typically two to three times smaller than an equivalent CSV. CSV requires converting every number to a string representation on save and parsing every string back to a number on load — that overhead is both slow and lossy for float64 values.
QHow do you save multiple arrays in a single file with NumPy?JuniorReveal
Use np.savez('file.npz', name1=array1, name2=array2) — each keyword argument becomes a named entry in the archive. Load the file with np.load('file.npz'), which returns an NpzFile object. Access individual arrays by key: loaded['name1']. To see all available keys, use list(loaded.keys()). If you also need to minimise disk usage for archiving or transfer, np.savez_compressed() applies zlib compression at the cost of slower save and load times.
QWhat is the difference between np.loadtxt and np.genfromtxt?Mid-levelReveal
np.loadtxt is strict — it requires every row to have the same number of columns, every cell to be convertible to the specified dtype, and no missing values. A single empty cell throws ValueError. np.genfromtxt is lenient — it fills missing values with NaN by default rather than aborting, supports structured dtypes that mix integer, float, and string columns, handles comment lines via the comments parameter, and can parse header rows into named columns with names=True. Use loadtxt for clean, uniform numerical files. Use genfromtxt when the file has gaps, mixed types, or comment lines.
QWhen would you use np.memmap instead of np.load?SeniorReveal
Use np.memmap when the array is too large to fit in RAM and you need random access to it without loading everything at once. np.memmap maps the file into virtual memory — when you access a slice, only that portion is read from disk. The array behaves like a regular ndarray for reads and writes, but the data lives on disk rather than in RAM. Typical use cases: very large simulation checkpoints, feature matrices that exceed available memory, or datasets where you need to iterate over batches without keeping the full array in memory. The file must be in binary format — memmap doesn't work with CSV.

Frequently Asked Questions

What is the difference between .npy and .npz files?

.npy stores a single array in NumPy's binary format — dtype, shape, and raw bytes, nothing else. .npz is a zip archive that stores multiple arrays by name, created with np.savez() or np.savez_compressed(). Both are loaded with np.load(). Loading a .npy file gives you an ndarray directly. Loading a .npz file gives you an NpzFile object that you index by key: loaded['array_name']. If you have one array and want the simplest load experience, use .npy.

How do I load a CSV file that has a header row?

Use skiprows=1 in np.loadtxt() to skip the first row. If you want to access columns by their header names rather than by index, use np.genfromtxt() with names=True — it parses the header and returns a structured array where each column is accessible by name: data['column_name']. For anything more complex — multiple header rows, non-standard delimiters, mixed types — Pandas read_csv handles it more cleanly.

How do I preserve full float64 precision when saving to CSV?

Use fmt='%.18e' in np.savetxt — this writes each number in scientific notation with 18 decimal places, which is sufficient to reconstruct the exact float64 value on load. The default fmt='%g' writes only 6 significant digits, which is enough for display but lossy for numerical pipelines. If precision matters, validate after loading with np.allclose(original, loaded, rtol=1e-15). Better yet, switch to .npy format and skip the precision question entirely.

Can I use np.loadtxt for files with missing values?

No — np.loadtxt throws ValueError on the first empty cell or unparseable value it encounters. Use np.genfromtxt() instead, which fills missing values with NaN by default and continues loading. You can override the fill value per column using the filling_values parameter. For CSV files with complex missing value patterns, Pandas read_csv with na_values gives you more control and handles edge cases that genfromtxt may not.

What is np.memmap and when should I use it?

np.memmap creates a memory-mapped array backed by a file on disk. Rather than reading the entire file into RAM upfront, it reads only the portions you access — slices, rows, or elements — on demand from disk. The array behaves like a regular ndarray for both reads and writes, but writes go to the file rather than RAM. Use np.memmap when your array is too large to fit in available memory, when you need random access to a large binary dataset without loading it entirely, or when multiple processes need to share a large array without duplicating it in memory.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged