NumPy loadtxt and savetxt — Reading and Writing Array Data
- np.loadtxt() and
np.savetxt()work with text files. Use delimiter and skiprows for CSV, and specify dtype= explicitly — never assume the default will match your data. - fmt='%g' in np.savetxt is lossy — it truncates to 6 significant digits. Use '%.18e' for full float64 round-trip fidelity. This single mistake has corrupted ML pipelines in production.
- np.save() and
np.load()work with binary .npy files — 10 to 50x faster than CSV, exact precision, and smaller on disk. Use this format for any numerical data exchange that doesn't need human readability.
- np.loadtxt() reads CSV/TSV text files into arrays — use delimiter and skiprows for control
- np.save() writes binary .npy files — 10-50x faster than CSV for large arrays
- np.savez() bundles multiple named arrays into a single .npz archive
- np.savez_compressed() adds zlib compression — slower to load but smaller on disk
- Binary formats preserve dtype exactly — CSV loses precision on floats unless you use fmt='%.18e'
- For mixed-type tabular data with headers, Pandas read_csv is the more appropriate tool
Wrong dtype after loading CSV
print(loaded.dtype, loaded.shape)np.loadtxt('data.csv', dtype=np.float64, delimiter=',')Precision loss between saved and loaded data
np.allclose(original, loaded, rtol=1e-15)np.max(np.abs(original - loaded))MemoryError loading a large CSV
ls -lh data.csv && free -hwc -l data.csvProduction Incident
Production Debug GuideWhen your loaded arrays don't match what you saved — and the error is silent
loaded.keys()). If you only have one array and want direct ndarray access, save it with np.save() as a .npy file instead.loadtxt and savetxt — Text Files
np.loadtxt() and np.savetxt() are the entry point most people find first, and for good reason — they work with plain text files that any editor, spreadsheet, or downstream tool can open. But they come with constraints that matter the moment you move beyond toy examples.
np.loadtxt() reads the entire file into a single ndarray. Every row must have the same number of columns, and the entire file must be a single dtype — if you mix strings and numbers in the same CSV, it will throw a ValueError. It also reads everything into memory at once, which means a 10GB file needs 10GB of RAM available before you get a single array out.
np.savetxt() writes an ndarray to a text file. The fmt parameter controls how each number is formatted — and this is where the silent precision bug lives. The default '%g' format writes at most 6 significant digits, which is fine for displaying numbers but genuinely lossy for float64 values with more precision than that. If you're using np.savetxt for anything that feeds back into a numerical pipeline, switch fmt to '%.18e', which preserves the full float64 round-trip.
For files that are intended for human inspection — a small sample of predictions, a validation output you're sharing with a stakeholder — CSV is the right choice. For data exchange between pipeline stages, it almost never is.
import numpy as np # ── Basic save/load round-trip ─────────────────────────────────────────────── data = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) # Save with header and two decimal places np.savetxt('data.csv', data, delimiter=',', header='a,b,c', fmt='%.2f') # Load — skiprows=1 skips the header line loaded = np.loadtxt('data.csv', delimiter=',', skiprows=1) print(loaded) # [[1. 2. 3.] # [4. 5. 6.]] # ── Load only specific columns ─────────────────────────────────────────────── # usecols accepts a single index or a tuple of indices col_a = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=0) print(col_a) # [1. 4.] cols_ab = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=(0, 1)) print(cols_ab) # [[1. 2.] # [4. 5.]] # ── Precision comparison: '%g' vs '%.18e' ──────────────────────────────────── small_value = np.array([0.000001234567890123456]) np.savetxt('lossy.csv', small_value, fmt='%g') np.savetxt('lossless.csv', small_value, fmt='%.18e') lossy = np.loadtxt('lossy.csv') lossless = np.loadtxt('lossless.csv') print(f'Original: {small_value[0]:.20f}') print(f'Lossy %%g: {lossy[0]:.20f}') print(f'Lossless: {lossless[0]:.20f}') print(f'Lossy matches original: {np.allclose(small_value, lossy, rtol=1e-15)}') print(f'Lossless matches original: {np.allclose(small_value, lossless, rtol=1e-15)}')
[4. 5. 6.]]
[1. 4.]
[[1. 2.]
[4. 5.]]
Original: 0.00000123456789012346
Lossy %g: 0.00000123456700000000
Lossless: 0.00000123456789012346
Lossy matches original: False
Lossless matches original: True
save and load — Fast Binary Format
np.save() and np.load() use NumPy's native binary format, .npy. The format stores three things: a magic number identifying it as NumPy data, a header containing the dtype, shape, and memory order, and the raw bytes of the array exactly as they exist in memory. There is no string conversion, no parsing, no rounding — what goes in comes back out identically.
The performance difference versus CSV is substantial enough to matter in real pipelines. A 1GB float64 array saves in roughly one second with np.save() — the bottleneck is disk write speed. The same array with np.savetxt() takes 30 to 60 seconds because every number has to be converted to its string representation. On load, the difference is similar. If your pipeline is spending meaningful time reading and writing feature matrices as CSV files, switching to .npy will make that time essentially disappear.
File size follows the same pattern. A .npy file for a float64 array is exactly 8 bytes per element plus a small header — there's no overhead for decimal points, separators, or newlines. A CSV for the same data is typically two to three times larger depending on the magnitude of the values.
np.savez() extends this to multiple arrays. It creates a .npz archive — which is structurally a zip file — where each array is stored under a key you provide. You load the whole archive with np.load() and access individual arrays by name. np.savez_compressed() applies zlib compression on top of that, which reduces file size further at the cost of CPU time on both save and load. Compression is worth it for archiving or transferring data; it's usually not worth it for data that gets loaded repeatedly during training.
import numpy as np import time # ── Single array: save and load ────────────────────────────────────────────── arr = np.random.randn(1000, 100).astype(np.float64) np.save('array.npy', arr) loaded = np.load('array.npy') print(loaded.shape) # (1000, 100) print(loaded.dtype) # float64 print(np.allclose(arr, loaded)) # True — exact round-trip # ── Multiple arrays bundled into one .npz file ─────────────────────────────── X = np.random.randn(10000, 128).astype(np.float32) # feature matrix y = np.random.randint(0, 10, size=10000) # integer labels metadata = np.array([128, 10], dtype=np.int32) # shape metadata np.savez('dataset.npz', features=X, labels=y, meta=metadata) bundle = np.load('dataset.npz') print(list(bundle.keys())) # ['features', 'labels', 'meta'] print(bundle['features'].shape) # (10000, 128) print(bundle['labels'].dtype) # int64 print(bundle['meta']) # [128, 10] # ── Compressed archive for long-term storage ───────────────────────────────── np.savez_compressed('dataset_compressed.npz', features=X, labels=y) # ── Rough speed comparison: .npy vs CSV for a 100k x 50 float64 array ──────── large = np.random.randn(100_000, 50) t0 = time.perf_counter() np.save('large.npy', large) t1 = time.perf_counter() np.savetxt('large.csv', large, fmt='%.18e', delimiter=',') t2 = time.perf_counter() print(f'np.save: {t1 - t0:.3f}s') print(f'np.savetxt: {t2 - t1:.3f}s')
float64
True
['features', 'labels', 'meta']
(10000, 128)
int64
[128 10]
np.save: 0.041s
np.savetxt: 38.7s
genfromtxt — Handling Messy Real-World CSVs
Real-world CSV files are not clean. They have missing values where a sensor didn't record, mixed types because someone exported a database table, comment lines at the top explaining the schema, and inconsistent formatting because the file went through three different tools before it reached you. np.loadtxt() handles none of this — it throws a ValueError the moment it encounters something it can't parse.
np.genfromtxt() is the answer for that middle ground: data that's numerical in intent but imperfect in practice. The key behavioural difference is that genfromtxt fills missing values with a placeholder — NaN by default for floats — rather than aborting. It also supports structured dtypes, which let you mix integer columns with float columns and string columns in the same file, and names=True which reads the header row and gives you column access by name.
The names=True output is a structured array, not a regular ndarray. It behaves differently from what most people expect — operations like data['score'] work, but standard array arithmetic does not apply across the whole structure the way it does with a 2D ndarray. It's a lightweight alternative to a Pandas DataFrame for cases where you need named columns but don't want the Pandas dependency.
For genuinely messy data with millions of rows, complex types, date columns, or anything you'll need to reshape and query heavily, Pandas read_csv is the better tool. genfromtxt sits between np.loadtxt and Pandas: it handles imperfect files, but it's not a DataFrame.
import numpy as np from io import StringIO # ── Missing values and mixed types ─────────────────────────────────────────── # This is the kind of CSV that arrives from a data export or sensor logger csv_data = '''id,score,label 1,0.85,cat 2,,dog 3,0.92,cat 4,0.78, 5,, ''' # genfromtxt fills missing values with NaN — loadtxt would throw here data = np.genfromtxt( StringIO(csv_data), delimiter=',', skip_header=1, dtype=[('id', 'i4'), ('score', 'f8'), ('label', 'U10')], missing_values='', filling_values={1: np.nan, 2: 'unknown'} # per-column fill values ) print(data['id']) # [1 2 3 4 5] print(data['score']) # [0.85 nan 0.92 0.78 nan] print(data['label']) # ['cat' 'dog' 'cat' 'unknown' 'unknown'] # ── names=True: auto-parse headers, get column access by name ───────────────── data2 = np.genfromtxt( StringIO(csv_data), delimiter=',', names=True, dtype=None, encoding='utf-8' ) print(data2.dtype.names) # ('id', 'score', 'label') print(data2['score']) # [0.85 nan 0.92 0.78 nan] # ── Filtering out NaN rows before analysis ─────────────────────────────────── valid_mask = ~np.isnan(data2['score']) valid_scores = data2['score'][valid_mask] print(f'Mean score (valid rows only): {valid_scores.mean():.4f}') # 0.8625 # ── Comment lines — genfromtxt can skip them automatically ─────────────────── csv_with_comments = '''# Exported from sensor array v2.1 # Units: voltage (V) 0.001, 0.002, 0.003 0.004, 0.005, 0.006 ''' clean = np.genfromtxt( StringIO(csv_with_comments), delimiter=',', comments='#' ) print(clean) # [[0.001 0.002 0.003] # [0.004 0.005 0.006]]
[0.85 nan 0.92 0.78 nan]
['cat' 'dog' 'cat' 'unknown' 'unknown']
('id', 'score', 'label')
[0.85 nan 0.92 0.78 nan]
Mean score (valid rows only): 0.8625
[[0.001 0.002 0.003]
[0.004 0.005 0.006]]
np.column_stack() after loading.| Format | Function | Speed | Precision | Multiple Arrays | Human Readable |
|---|---|---|---|---|---|
| CSV/TSV | np.savetxt / np.loadtxt | Slow — every number converted to string and back | Lossy with '%g' (6 sig figs), lossless only with '%.18e' | No — one array per file | Yes — opens in any spreadsheet or text editor |
| .npy | np.save / np.load | Fast — raw bytes written directly, no conversion | Exact — dtype, shape, and all bytes preserved | No — one array per file | No — binary format |
| .npz | np.savez / np.load | Fast — raw bytes in a zip container | Exact — no conversion applied | Yes — multiple named arrays in one file | No — binary format |
| .npz compressed | np.savez_compressed / np.load | Slower — zlib compression adds CPU overhead on save and load | Exact — compression is lossless | Yes — multiple named arrays in one file | No — binary format |
| Memory-mapped | np.memmap | Lazy — reads from disk only on access, not upfront | Exact — no conversion | No — one array per file | No — binary format |
🎯 Key Takeaways
- np.loadtxt() and
np.savetxt()work with text files. Use delimiter and skiprows for CSV, and specify dtype= explicitly — never assume the default will match your data. - fmt='%g' in np.savetxt is lossy — it truncates to 6 significant digits. Use '%.18e' for full float64 round-trip fidelity. This single mistake has corrupted ML pipelines in production.
- np.save() and
np.load()work with binary .npy files — 10 to 50x faster than CSV, exact precision, and smaller on disk. Use this format for any numerical data exchange that doesn't need human readability. - np.savez() bundles multiple named arrays into a single .npz archive. Access them after loading with loaded['name']. Use
np.savez_compressed()for archiving or transfer — not for data that loads repeatedly. - np.genfromtxt() handles missing values, mixed types, and comment lines where np.loadtxt would throw. For truly complex tabular data with millions of rows, Pandas read_csv is the more appropriate tool.
- np.memmap creates a memory-mapped array that reads from disk on access — use it when the array is too large to fit in RAM.
- Validate loaded data with np.allclose(original, loaded) at pipeline boundaries — a successful load does not mean the values are correct.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the advantage of .npy over CSV for large NumPy arrays?JuniorReveal
- QHow do you save multiple arrays in a single file with NumPy?JuniorReveal
- QWhat is the difference between np.loadtxt and np.genfromtxt?Mid-levelReveal
- QWhen would you use np.memmap instead of np.load?SeniorReveal
Frequently Asked Questions
What is the difference between .npy and .npz files?
.npy stores a single array in NumPy's binary format — dtype, shape, and raw bytes, nothing else. .npz is a zip archive that stores multiple arrays by name, created with np.savez() or np.savez_compressed(). Both are loaded with np.load(). Loading a .npy file gives you an ndarray directly. Loading a .npz file gives you an NpzFile object that you index by key: loaded['array_name']. If you have one array and want the simplest load experience, use .npy.
How do I load a CSV file that has a header row?
Use skiprows=1 in np.loadtxt() to skip the first row. If you want to access columns by their header names rather than by index, use np.genfromtxt() with names=True — it parses the header and returns a structured array where each column is accessible by name: data['column_name']. For anything more complex — multiple header rows, non-standard delimiters, mixed types — Pandas read_csv handles it more cleanly.
How do I preserve full float64 precision when saving to CSV?
Use fmt='%.18e' in np.savetxt — this writes each number in scientific notation with 18 decimal places, which is sufficient to reconstruct the exact float64 value on load. The default fmt='%g' writes only 6 significant digits, which is enough for display but lossy for numerical pipelines. If precision matters, validate after loading with np.allclose(original, loaded, rtol=1e-15). Better yet, switch to .npy format and skip the precision question entirely.
Can I use np.loadtxt for files with missing values?
No — np.loadtxt throws ValueError on the first empty cell or unparseable value it encounters. Use np.genfromtxt() instead, which fills missing values with NaN by default and continues loading. You can override the fill value per column using the filling_values parameter. For CSV files with complex missing value patterns, Pandas read_csv with na_values gives you more control and handles edge cases that genfromtxt may not.
What is np.memmap and when should I use it?
np.memmap creates a memory-mapped array backed by a file on disk. Rather than reading the entire file into RAM upfront, it reads only the portions you access — slices, rows, or elements — on demand from disk. The array behaves like a regular ndarray for both reads and writes, but writes go to the file rather than RAM. Use np.memmap when your array is too large to fit in available memory, when you need random access to a large binary dataset without loading it entirely, or when multiple processes need to share a large array without duplicating it in memory.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.