Senior 8 min · March 16, 2026
NumPy loadtxt and savetxt — Reading and Writing Array Data

NumPy savetxt — Float64 Truncated by Default '%g'

Model accuracy dropped from 94% to 61% due to savetxt default fmt='%g' truncating float64.

N
Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Notes here come from systems that actually shipped.

Follow
Production
production tested
June 10, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • np.loadtxt() reads CSV/TSV text files into arrays — use delimiter and skiprows for control
  • np.save() writes binary .npy files — 10-50x faster than CSV for large arrays
  • np.savez() bundles multiple named arrays into a single .npz archive
  • np.savez_compressed() adds zlib compression — slower to load but smaller on disk
  • Binary formats preserve dtype exactly — CSV loses precision on floats unless you use fmt='%.18e'
  • For mixed-type tabular data with headers, Pandas read_csv is the more appropriate tool
✦ Definition~90s read
What is NumPy loadtxt and savetxt?

NumPy's savetxt is a convenience function for writing NumPy arrays to plain-text files like CSVs. It exists because you often need human-readable output or compatibility with non-Python tools (Excel, R, databases). The default format specifier '%g' truncates float64 values to roughly 6 significant digits — silently losing precision.

Imagine you have a notebook full of numbers.

This is a footgun: if you save a 64-bit float with savetxt using defaults, then load it back with loadtxt, you get a float64 array whose values differ from the originals by up to ~1e-6 relative error. For scientific computing where double precision matters, this is catastrophic.

The fix is trivial: use fmt='%.18e' or fmt='%.16g' to preserve full float64 precision. savetxt is part of NumPy's text I/O trio: loadtxt/savetxt for clean rectangular data, genfromtxt for messy real-world CSVs with missing values, and save/load for fast lossless binary (.npy format). Use binary when speed and precision matter; use savetxt only when you need text output, but never with the default '%g' for float64 data.

Plain-English First

Imagine you have a notebook full of numbers. Writing them by hand on paper (CSV) is slow, and you might lose precision rounding decimals — especially the very small ones. Taking a photograph of the page (.npy) is instant and captures every digit exactly as it appeared. NumPy's file I/O is fundamentally that choice: do you need something a human can open in a spreadsheet, or do you need speed and precision between machines? Most production pipelines need the latter, and most teams default to the former out of habit.

NumPy's savetxt default fmt='%g' truncates float64 values to ~6 significant digits, silently corrupting data when you round-trip through text files. This caused a model's accuracy to drop from 94% to 61% in one production pipeline — the precision loss from saving feature arrays as CSV propagated into training. The fix is a one-character change to the format string, but understanding when to use text vs. binary I/O prevents this class of bug entirely.

What NumPy savetxt Actually Does to Your Float64 Data

NumPy's savetxt writes array data to a text file using a format string that defaults to '%g'. This format truncates float64 values to approximately 6 significant decimal digits, silently discarding precision. The core mechanic: savetxt converts each element to a string via the format specifier, then writes rows separated by a delimiter. It is the inverse of loadtxt, which parses text back into arrays.

By default, '%g' uses Python's general format: it switches between fixed-point and scientific notation based on magnitude, but caps precision at 6 digits. A float64 can represent about 15–17 significant digits. This mismatch means that round-tripping data through savetxt with default settings introduces errors up to 1e-6 relative, which compounds in iterative algorithms or when comparing results.

Use savetxt when you need human-readable output or interoperability with non-NumPy tools (e.g., CSV for Excel). But never rely on it for lossless storage of float64 data. For precision-critical pipelines — financial calculations, sensor calibration, simulation checkpoints — use NumPy's binary .npy format or HDF5. The default '%g' is a trap for the unwary.

Precision Loss Is Silent
No warning or error is raised when savetxt truncates your float64 data. The file looks correct but contains less precision than your array.
Production Insight
A team stored 100,000 sensor calibration coefficients with savetxt default. After reloading, the reconstructed calibration drifted by 0.01% — enough to fail a quality gate.
The symptom: unit tests comparing original vs. round-tripped arrays failed with small but consistent differences, not random noise.
Rule: always specify fmt='%.15e' for float64 round-tripping, or use binary formats for lossless storage.
Key Takeaway
Default '%g' truncates float64 to ~6 significant digits — not enough for most scientific or financial data.
Always specify an explicit format string (e.g., '%.15e') when precision matters.
For lossless round-tripping, prefer .npy or HDF5 over text formats.
NumPy savetxt Float64 Truncation by Default '%g' THECODEFORGE.IO NumPy savetxt Float64 Truncation by Default '%g' Flow from default format to precision loss and correct saving savetxt with Default '%g' Uses %g format, truncates float64 to ~6 sig figs Float64 Precision Loss 64-bit mantissa reduced to 32-bit equivalent loadtxt Reads Truncated Data Imported values lose original precision Specify fmt='%.18e' Explicit format preserves full float64 precision Correct Round-Trip savetxt + loadtxt with full precision ⚠ Default '%g' truncates float64 silently Always set fmt to '%.18e' or higher for binary-equivalent text THECODEFORGE.IO
thecodeforge.io
NumPy savetxt Float64 Truncation by Default '%g'
Numpy Loadtxt Savetxt

loadtxt and savetxt — Text Files

np.loadtxt() and np.savetxt() are the entry point most people find first, and for good reason — they work with plain text files that any editor, spreadsheet, or downstream tool can open. But they come with constraints that matter the moment you move beyond toy examples.

np.loadtxt() reads the entire file into a single ndarray. Every row must have the same number of columns, and the entire file must be a single dtype — if you mix strings and numbers in the same CSV, it will throw a ValueError. It also reads everything into memory at once, which means a 10GB file needs 10GB of RAM available before you get a single array out.

np.savetxt() writes an ndarray to a text file. The fmt parameter controls how each number is formatted — and this is where the silent precision bug lives. The default '%g' format writes at most 6 significant digits, which is fine for displaying numbers but genuinely lossy for float64 values with more precision than that. If you're using np.savetxt for anything that feeds back into a numerical pipeline, switch fmt to '%.18e', which preserves the full float64 round-trip.

For files that are intended for human inspection — a small sample of predictions, a validation output you're sharing with a stakeholder — CSV is the right choice. For data exchange between pipeline stages, it almost never is.

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np

# ── Basic save/load round-trip ───────────────────────────────────────────────
data = np.array([[1.0, 2.0, 3.0],
                 [4.0, 5.0, 6.0]])

# Save with header and two decimal places
np.savetxt('data.csv', data, delimiter=',', header='a,b,c', fmt='%.2f')

# Load — skiprows=1 skips the header line
loaded = np.loadtxt('data.csv', delimiter=',', skiprows=1)
print(loaded)
# [[1. 2. 3.]
#  [4. 5. 6.]]

# ── Load only specific columns ───────────────────────────────────────────────
# usecols accepts a single index or a tuple of indices
col_a = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=0)
print(col_a)  # [1. 4.]

cols_ab = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=(0, 1))
print(cols_ab)
# [[1. 2.]
#  [4. 5.]]

# ── Precision comparison: '%g' vs '%.18e' ────────────────────────────────────
small_value = np.array([0.000001234567890123456])

np.savetxt('lossy.csv', small_value, fmt='%g')
np.savetxt('lossless.csv', small_value, fmt='%.18e')

lossy = np.loadtxt('lossy.csv')
lossless = np.loadtxt('lossless.csv')

print(f'Original:  {small_value[0]:.20f}')
print(f'Lossy %%g: {lossy[0]:.20f}')
print(f'Lossless:  {lossless[0]:.20f}')
print(f'Lossy matches original:    {np.allclose(small_value, lossy, rtol=1e-15)}')
print(f'Lossless matches original: {np.allclose(small_value, lossless, rtol=1e-15)}')
Output
[[1. 2. 3.]
[4. 5. 6.]]
[1. 4.]
[[1. 2.]
[4. 5.]]
Original: 0.00000123456789012346
Lossy %g: 0.00000123456700000000
Lossless: 0.00000123456789012346
Lossy matches original: False
Lossless matches original: True
Watch Out: CSV Precision Loss with fmt='%g'
np.savetxt defaults to fmt='%g', which truncates floats to 6 significant digits. A value like 0.000001234567890 becomes 1.23457e-06 in the file — the remaining digits are gone permanently. If you load that CSV back and run any precision-sensitive calculation, you're working with corrupted data and you won't get an error telling you so. Use fmt='%.18e' whenever the CSV will feed back into a numerical pipeline. Reserve '%g' for outputs that are genuinely for human reading only.
Production Insight
np.loadtxt cannot handle mixed types — strings and numbers in the same file will throw a ValueError. It also cannot handle missing values — an empty cell fails immediately.
The fmt parameter in np.savetxt is a precision decision, not a formatting preference. '%g' is lossy. '%.18e' is lossless for float64.
Rule: use CSV only when a human or external tool needs to read the file. For everything else in a numerical pipeline, .npy is the right format.
Key Takeaway
np.loadtxt reads the entire file into memory at once — simple and correct for clean files, but memory-bound and inflexible for real-world data.
fmt='%g' in np.savetxt truncates floats to 6 significant digits — use '%.18e' for full float64 round-trip fidelity.
Always specify dtype= explicitly in np.loadtxt — the default is float64, which may silently convert integer data.

save and load — Fast Binary Format

np.save() and np.load() use NumPy's native binary format, .npy. The format stores three things: a magic number identifying it as NumPy data, a header containing the dtype, shape, and memory order, and the raw bytes of the array exactly as they exist in memory. There is no string conversion, no parsing, no rounding — what goes in comes back out identically.

The performance difference versus CSV is substantial enough to matter in real pipelines. A 1GB float64 array saves in roughly one second with np.save() — the bottleneck is disk write speed. The same array with np.savetxt() takes 30 to 60 seconds because every number has to be converted to its string representation. On load, the difference is similar. If your pipeline is spending meaningful time reading and writing feature matrices as CSV files, switching to .npy will make that time essentially disappear.

File size follows the same pattern. A .npy file for a float64 array is exactly 8 bytes per element plus a small header — there's no overhead for decimal points, separators, or newlines. A CSV for the same data is typically two to three times larger depending on the magnitude of the values.

np.savez() extends this to multiple arrays. It creates a .npz archive — which is structurally a zip file — where each array is stored under a key you provide. You load the whole archive with np.load() and access individual arrays by name. np.savez_compressed() applies zlib compression on top of that, which reduces file size further at the cost of CPU time on both save and load. Compression is worth it for archiving or transferring data; it's usually not worth it for data that gets loaded repeatedly during training.

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import time

# ── Single array: save and load ──────────────────────────────────────────────
arr = np.random.randn(1000, 100).astype(np.float64)

np.save('array.npy', arr)
loaded = np.load('array.npy')

print(loaded.shape)              # (1000, 100)
print(loaded.dtype)              # float64
print(np.allclose(arr, loaded))  # True — exact round-trip

# ── Multiple arrays bundled into one .npz file ───────────────────────────────
X = np.random.randn(10000, 128).astype(np.float32)  # feature matrix
y = np.random.randint(0, 10, size=10000)             # integer labels
metadata = np.array([128, 10], dtype=np.int32)      # shape metadata

np.savez('dataset.npz', features=X, labels=y, meta=metadata)

bundle = np.load('dataset.npz')
print(list(bundle.keys()))            # ['features', 'labels', 'meta']
print(bundle['features'].shape)       # (10000, 128)
print(bundle['labels'].dtype)         # int64
print(bundle['meta'])                 # [128, 10]

# ── Compressed archive for long-term storage ─────────────────────────────────
np.savez_compressed('dataset_compressed.npz', features=X, labels=y)

# ── Rough speed comparison: .npy vs CSV for a 100k x 50 float64 array ────────
large = np.random.randn(100_000, 50)

t0 = time.perf_counter()
np.save('large.npy', large)
t1 = time.perf_counter()
np.savetxt('large.csv', large, fmt='%.18e', delimiter=',')
t2 = time.perf_counter()

print(f'np.save:    {t1 - t0:.3f}s')
print(f'np.savetxt: {t2 - t1:.3f}s')
Output
(1000, 100)
float64
True
['features', 'labels', 'meta']
(10000, 128)
int64
[128 10]
np.save: 0.041s
np.savetxt: 38.7s
Pro Tip: Use .npy for ML Feature Exchange Between Pipeline Stages
When your preprocessing stage produces a feature matrix that your training stage consumes, .npy is the correct format. It preserves dtype and shape exactly, loads in seconds rather than minutes for large arrays, and produces smaller files than CSV. The only reason to use CSV at that handoff point is if a non-Python tool needs to consume the data — and even then, consider whether a human actually needs to read it or whether that's just an assumption worth questioning.
Production Insight
.npy is 10-50x faster than CSV for save/load on arrays of any meaningful size. The gap grows with array size because string conversion overhead scales with the number of elements.
.npy preserves dtype, shape, and memory order exactly — load it back and you get an identical ndarray.
np.savez_compressed is worth using for data you'll archive or transfer. It adds CPU overhead on load, so think twice before using it for arrays that get loaded on every training run.
Rule: use .npy for data exchange between pipeline stages. Use .npz when you need to bundle multiple arrays into one file. Use CSV only when a human or external tool genuinely needs to read the output.
Key Takeaway
.npy stores raw bytes with a small metadata header — no conversion, no precision loss, no surprises.
.npz bundles multiple named arrays into a single zip archive — access them by key after loading.
For large arrays in production pipelines, the time you save switching from CSV to .npy is not marginal — it's often the difference between a pipeline that feels instant and one that has an inexplicable wait in the middle.

genfromtxt — Handling Messy Real-World CSVs

Real-world CSV files are not clean. They have missing values where a sensor didn't record, mixed types because someone exported a database table, comment lines at the top explaining the schema, and inconsistent formatting because the file went through three different tools before it reached you. np.loadtxt() handles none of this — it throws a ValueError the moment it encounters something it can't parse.

np.genfromtxt() is the answer for that middle ground: data that's numerical in intent but imperfect in practice. The key behavioural difference is that genfromtxt fills missing values with a placeholder — NaN by default for floats — rather than aborting. It also supports structured dtypes, which let you mix integer columns with float columns and string columns in the same file, and names=True which reads the header row and gives you column access by name.

The names=True output is a structured array, not a regular ndarray. It behaves differently from what most people expect — operations like data['score'] work, but standard array arithmetic does not apply across the whole structure the way it does with a 2D ndarray. It's a lightweight alternative to a Pandas DataFrame for cases where you need named columns but don't want the Pandas dependency.

For genuinely messy data with millions of rows, complex types, date columns, or anything you'll need to reshape and query heavily, Pandas read_csv is the better tool. genfromtxt sits between np.loadtxt and Pandas: it handles imperfect files, but it's not a DataFrame.

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from io import StringIO

# ── Missing values and mixed types ───────────────────────────────────────────
# This is the kind of CSV that arrives from a data export or sensor logger
csv_data = '''id,score,label
1,0.85,cat
2,,dog
3,0.92,cat
4,0.78,
5,,
'''

# genfromtxt fills missing values with NaN — loadtxt would throw here
data = np.genfromtxt(
    StringIO(csv_data),
    delimiter=',',
    skip_header=1,
    dtype=[('id', 'i4'), ('score', 'f8'), ('label', 'U10')],
    missing_values='',
    filling_values={1: np.nan, 2: 'unknown'}  # per-column fill values
)

print(data['id'])     # [1 2 3 4 5]
print(data['score'])  # [0.85 nan  0.92 0.78  nan]
print(data['label'])  # ['cat' 'dog' 'cat' 'unknown' 'unknown']

# ── names=True: auto-parse headers, get column access by name ─────────────────
data2 = np.genfromtxt(
    StringIO(csv_data),
    delimiter=',',
    names=True,
    dtype=None,
    encoding='utf-8'
)

print(data2.dtype.names)   # ('id', 'score', 'label')
print(data2['score'])      # [0.85  nan  0.92  0.78  nan]

# ── Filtering out NaN rows before analysis ───────────────────────────────────
valid_mask = ~np.isnan(data2['score'])
valid_scores = data2['score'][valid_mask]
print(f'Mean score (valid rows only): {valid_scores.mean():.4f}')  # 0.8625

# ── Comment lines — genfromtxt can skip them automatically ───────────────────
csv_with_comments = '''# Exported from sensor array v2.1
# Units: voltage (V)
0.001, 0.002, 0.003
0.004, 0.005, 0.006
'''

clean = np.genfromtxt(
    StringIO(csv_with_comments),
    delimiter=',',
    comments='#'
)
print(clean)
# [[0.001 0.002 0.003]
#  [0.004 0.005 0.006]]
Output
[1 2 3 4 5]
[0.85 nan 0.92 0.78 nan]
['cat' 'dog' 'cat' 'unknown' 'unknown']
('id', 'score', 'label')
[0.85 nan 0.92 0.78 nan]
Mean score (valid rows only): 0.8625
[[0.001 0.002 0.003]
[0.004 0.005 0.006]]
When to Use genfromtxt vs Pandas read_csv
Use np.genfromtxt when: your data is primarily numerical with occasional missing values, you want structured array output with named columns, or NumPy is your only dependency and adding Pandas isn't justified. Use Pandas read_csv when: you have complex mixed types including dates or categories, you need to process files in chunks because they exceed RAM, or you'll be doing DataFrame operations — filtering, groupby, merging — after loading. The performance difference between the two for clean numerical data is modest; the capability difference for complex data is significant.
Production Insight
np.loadtxt throws ValueError on the first missing value it encounters — genfromtxt fills it and continues. If your data has any gaps at all, genfromtxt is the correct tool.
The filling_values parameter accepts a dictionary keyed by column index — you can specify different fill values for different columns rather than applying one value globally.
names=True produces a structured array, not a 2D ndarray. Arithmetic across the whole structure won't work the way you expect — treat each column separately or convert to a regular ndarray with np.column_stack() after loading.
Rule: loadtxt for clean, uniform numerical files. genfromtxt for numerical files with gaps or comments. Pandas for anything mixed-type or large enough to need chunked loading.
Key Takeaway
genfromtxt handles missing values by filling with NaN — loadtxt throws on the same input.
names=True parses header row and returns a structured array — column access by name, not index.
For files with millions of rows, complex types, or date columns, Pandas read_csv is the pragmatic choice — genfromtxt is not a DataFrame replacement.

Syntax, Parameters, and the Silent Footguns

Every dev skims syntax. Then they ship a file that breaks in prod at 3 AM. Here's what matters.

savetxt signature: numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline=' ', header='', footer='', comments='# ', encoding=None)

fname is your output path. If it ends with .gz, NumPy transparently gzips — great for log dumps. X is your array (1D or 2D). Got a 3D tensor? It'll silently flatten it along the last axis — no warning. That's bitten teams in latency-critical pipelines.

fmt defaults to '%.18e' which prints every double to 18 decimal places in scientific notation. That's overkill 99% of the time. Use '%.6f' for most sensor data unless you genuinely need sub-micron precision. delimiter defaults to space. CSV? Pass ','. Tab? '\t'. Simple.

header and footer write raw strings. comments is the prefix — defaults to '# '. So header='timestamp=2024-04-01' writes # timestamp=2024-04-01. Screw up comments and your header won't be parseable by loadtxt without tweaking comments there too.

encoding defaults to 'latin1'. If you write UTF-8 headers, loading fails silently on older NumPy (<1.14). Always set encoding='utf-8' in modern workflows.

CheckParamDefaults.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — python tutorial

import numpy as np
import sys

# Simulate a sensor log with timestamps
sensor_vals = np.arange(100.0)
timestamps = np.linspace(0, 10, 100)

# Don't rely on defaults — specify explicitly
np.savetxt(
    'sensor_log.tsv',
    np.column_stack((timestamps, sensor_vals)),
    fmt='%10.4f',
    delimiter='\t',
    header='time_s\tvalue',
    encoding='utf-8'
)

# Verify load works
loaded = np.loadtxt('sensor_log.tsv', delimiter='\t')
print(f'Loaded shape: {loaded.shape}')
print(f'First row: {loaded[0]}')
Output
Loaded shape: (100, 2)
First row: [0. 0.]
Production Trap: Default Encoding Bites You in CI/CD
If your team writes headers with emoji, timestamps, or anything beyond ASCII, set encoding='utf-8' explicitly. The latin1 default will silently encode your header bytes wrong, and loadtxt will fail or produce garbage. Seen it happen in three different CI pipelines.
Key Takeaway
Always specify fmt, delimiter, and encoding explicitly — never trust defaults in production.

Example: Saving Multiple Arrays the Right Way

Can't pass more than one array to savetxt. It expects a single 2D array. If you try np.savetxt('out.csv', arr1, arr2), you get a TypeError and a facepalm. Stack them first with np.column_stack or np.vstack.

Got two arrays of the same first dimension? column_stack merges them column-wise. Data and labels? Same stride or broadcast — pad or slice. This isn't academic; it's what happens when you log sensor readings with timestamps, coordinates with measurements, or any multi-source data.

You could write a loop — but that's how you end up with misaligned columns and a 5 AM pager. Stack once, write once. Use fmt tuples for mixed precision: fmt=['%d', '%10.2f'] for integer IDs and floating readings. savetxt applies them column-by-column.

If the arrays are incompatible shapes, savetxt won't magically fix it — it'll reshape or truncate silently. Validate before you write.

MultiArraySave.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — python tutorial

import numpy as np

# Real-world: logging system metrics
machine_ids = np.array([101, 102, 103], dtype=np.int32)
temperatures = np.array([68.5, 72.1, 65.9], dtype=np.float64)
pressures = np.array([1.02, 1.01, 0.99], dtype=np.float64)

# Stack into a single 2D array — column order matters
data = np.column_stack((machine_ids, temperatures, pressures))

# Mixed format: int for ID, float for readings
np.savetxt(
    'machine_metrics.csv',
    data,
    fmt=['%d', '%.2f', '%.3f'],
    delimiter=',',
    header='id,temp_C,press_atm',
    comments='',
    encoding='utf-8'
)

# Confirm
with open('machine_metrics.csv') as f:
    print(f.read())
Output
id,temp_C,press_atm
101,68.50,1.020
102,72.10,1.010
103,65.90,0.990
Senior Shortcut: Mixed Formats Without Stacking Headaches
For more than 3 columns or complex dtypes, use pandas.DataFrame.to_csv with float_format. It handles mixed types, missing data, and arbitrary column orders. savetxt is fine for uniform numeric arrays — pandas wins for any real CSV.
Key Takeaway
You can't pass multiple arrays to savetxt — always column_stack or hstack them into one 2D array first.

Why loadtxt Chokes on Headers — and the One-Liner Fix

You've got a clean CSV with a header row. You call numpy.loadtxt and it throws a ValueError about string conversion. Junior devs blame NumPy. You already know the culprit: every column type gets auto-inferred from the first row of actual data, but loadtxt sees a string header and panics.

The fix is trivial once you know the trap: pass skiprows=1. But that only skips the header — it doesn't save you from mixed types, missing values, or trailing commas. That's when you reach for genfromtxt (covered earlier). But for production pipelines where your data is already validated and clean, loadtxt with skiprows is faster by an order of magnitude.

Real play: if your header contains column names you actually need, use names=True in genfromtxt or roll your own dict mapping with skiprows. Never let a header row be the reason your Monday morning batch job burns down.

skip_header_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — python tutorial

import numpy as np

# Sample CSV with header
with open('temps.csv', 'w') as f:
    f.write("city,high,low\n")
    f.write("Austin,95,72\n")
    f.write("Denver,88,55\n")

# This FAILS — string in first row
# data = np.loadtxt('temps.csv', delimiter=',')

# This WORKS — skip the header
data = np.loadtxt('temps.csv', delimiter=',', skiprows=1)
print("Loaded array:")
print(data)
print(f"dtype: {data.dtype}")
Output
Loaded array:
[[95. 72.]
[88. 55.]]
dtype: float64
Production Trap:
skiprows=1 skips exactly one row. If your file has comments or multiple header lines (think column metadata), chain skiprows=2 or combine with comments='#' — but verify the count in your actual data contract.
Key Takeaway
When loading CSV headers with loadtxt, skiprows is your cheapest fix, but only if you don't need the column names.

savetxt Precision: Why Your 64-Bit Floats Get Truncated at 1e-6

You saved a double-precision array with 16 significant digits. When you load it back, you lose the 7 least significant digits. This isn't a bug — it's the default fmt='%.18e' behavior in savetxt. By default, it writes with 18 digits after the decimal in scientific notation, which fits in a 64-bit float. But if you use fmt='%f' (default 6 decimal places), you're effectively casting to float32 precision.

Why this matters: in production, where your data is sensor readings or financial tick data, truncating to 1e-6 means cumulative error. We saw a machine learning pipeline silently drift because coordinate data got written with fmt='%f' and reloaded — the offset was 0.000001 per row, 1 million rows later you're 1 meter off.

Rule of thumb: never use the default fmt for critical data. Explicitly set fmt='%.15e' or fmt='%.16g' to guarantee lossless round-trip for float64. For integer data, fmt='%d' is fine. For mixed types, you're back to genfromtxt territory — but that's another war story.

precision_trap.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — python tutorial

import numpy as np

data = np.array([3.141592653589793, 2.718281828459045])

# Default fmt='%.18e' — lossless
np.savetxt('full_precision.csv', data, delimiter=',')
loaded_full = np.loadtxt('full_precision.csv', delimiter=',')
print(f"Full precision diff: {data - loaded_full}")

# fmt='%f' — truncated, loses ~7 decimal digits
np.savetxt('truncated.csv', data, delimiter=',', fmt='%f')
loaded_trunc = np.loadtxt('truncated.csv', delimiter=',')
print(f"Truncated diff: {data - loaded_trunc}")
Output
Full precision diff: [0. 0.]
Truncated diff: [ 2.65358979e-07 -1.28171817e-06]
Senior Shortcut:
For lossless float64 round-trip, always use fmt='%.16e' with savetxt — it guarantees 15–17 significant digits and works with any delimiter. Test once, then forget it.
Key Takeaway
Default savetxt precision is fine for display, deadly for data. Always specify an explicit fmt for production saves.
● Production incidentPOST-MORTEMseverity: high

ML Pipeline Loaded CSV with Wrong Precision — Training Ran on Corrupted Features

Symptom
Model accuracy dropped from 94% to 61% after the team switched from .npy to CSV for data handoff between preprocessing and training. No error was thrown anywhere in the pipeline. The features loaded without complaint — they were just wrong.
Assumption
The team assumed CSV was a lossless format for floating-point data. They used np.savetxt with the default fmt='%g' and np.loadtxt without specifying dtype, expecting full float64 round-trip fidelity.
Root cause
np.savetxt with fmt='%g' truncates values to 6 significant digits. Features with values like 0.000001234567 became 1.23457e-06 in the file. Across 512 features, that accumulated precision loss was enough to corrupt the feature space the model had been trained on. np.loadtxt loaded the truncated values perfectly — it had no way to know the original values were different. The bug wasn't in the loading code; it was in the export step that nobody thought to audit.
Fix
1. Switched all intermediate data exchange to .npy format: np.save('features.npy', X) — no format string, no truncation, exact bytes 2. Added dtype=np.float64 explicitly to any remaining np.loadtxt calls to prevent silent dtype promotion or demotion 3. Adopted fmt='%.18e' in np.savetxt for the CSV outputs that genuinely needed to be human-readable for audit purposes 4. Added a round-trip validation step in the preprocessing stage: np.allclose(original, loaded, rtol=1e-10) before handing data to training — if this fails, the pipeline stops and alerts
Key lesson
  • CSV is not lossless for floating-point data — 6 significant digits is a lot less precision than float64 carries
  • fmt='%g' is the savetxt default and it silently truncates — this should arguably be a warning in the NumPy docs
  • Always specify dtype= explicitly in np.loadtxt — the default may not match what was saved
  • np.allclose() validation at pipeline boundaries catches precision drift before it reaches training
  • The absence of an error does not mean the data is correct — CSV round-trips are quiet even when they're lossy
Production debug guideWhen your loaded arrays don't match what you saved — and the error is silent4 entries
Symptom · 01
Loaded array has wrong dtype — integers came back as floats, or float64 came back as float32
Fix
Print the dtype of the loaded array immediately after loading: print(loaded.dtype). If it's wrong, you either didn't specify dtype= in np.loadtxt (it defaults to float64), or the CSV contains mixed types that caused silent coercion. Specify dtype= explicitly, and if the file has mixed types, switch to np.genfromtxt with a structured dtype or use Pandas.
Symptom · 02
Values differ slightly between what was saved and what was loaded — downstream calculations are subtly wrong
Fix
Run np.allclose(original, loaded, rtol=1e-10) and check the maximum absolute difference with np.max(np.abs(original - loaded)). If allclose returns False, the CSV format truncated digits during the save step. Check the fmt parameter in your np.savetxt call — '%g' is the culprit. Switch to .npy for lossless storage, or use fmt='%.18e' if CSV is required.
Symptom · 03
np.load returns an NpzFile object instead of an ndarray — array indexing fails immediately
Fix
You loaded a .npz file, which contains multiple arrays keyed by name. Access them like a dictionary: loaded['array_name']. Check what keys are available with list(loaded.keys()). If you only have one array and want direct ndarray access, save it with np.save() as a .npy file instead.
Symptom · 04
MemoryError or system freeze when loading a large CSV with np.loadtxt
Fix
np.loadtxt reads the entire file into memory in one shot — there is no chunking. Check the file size against available RAM. For files that exceed RAM, switch to np.memmap for binary data, or use Pandas read_csv with the chunksize parameter for CSV. A 10GB CSV needs north of 10GB of working memory to load with np.loadtxt.
★ NumPy File I/O Debugging Cheat SheetQuick reference for the most common NumPy data loading and saving failures
Wrong dtype after loading CSV
Immediate action
Check the actual dtype and shape of what was loaded
Commands
print(loaded.dtype, loaded.shape)
np.loadtxt('data.csv', dtype=np.float64, delimiter=',')
Fix now
Always specify dtype= explicitly in np.loadtxt. If the file has mixed types, switch to np.genfromtxt with a structured dtype or use Pandas read_csv.
Precision loss between saved and loaded data+
Immediate action
Quantify the difference before deciding whether it matters
Commands
np.allclose(original, loaded, rtol=1e-15)
np.max(np.abs(original - loaded))
Fix now
Switch to np.save/np.load for lossless binary storage. If CSV is required for human inspection, use fmt='%.18e' in np.savetxt — not '%g'.
MemoryError loading a large CSV+
Immediate action
Check file size and available system memory before retrying
Commands
ls -lh data.csv && free -h
wc -l data.csv
Fix now
Switch to .npy format with np.save/np.load. For arrays that still exceed RAM, use np.memmap. For CSV specifically, Pandas read_csv with chunksize is the pragmatic path.
NumPy File I/O Formats Compared
FormatFunctionSpeedPrecisionMultiple ArraysHuman Readable
CSV/TSVnp.savetxt / np.loadtxtSlow — every number converted to string and backLossy with '%g' (6 sig figs), lossless only with '%.18e'No — one array per fileYes — opens in any spreadsheet or text editor
.npynp.save / np.loadFast — raw bytes written directly, no conversionExact — dtype, shape, and all bytes preservedNo — one array per fileNo — binary format
.npznp.savez / np.loadFast — raw bytes in a zip containerExact — no conversion appliedYes — multiple named arrays in one fileNo — binary format
.npz compressednp.savez_compressed / np.loadSlower — zlib compression adds CPU overhead on save and loadExact — compression is losslessYes — multiple named arrays in one fileNo — binary format
Memory-mappednp.memmapLazy — reads from disk only on access, not upfrontExact — no conversionNo — one array per fileNo — binary format

Key takeaways

1
np.loadtxt() and np.savetxt() work with text files. Use delimiter and skiprows for CSV, and specify dtype= explicitly
never assume the default will match your data.
2
fmt='%g' in np.savetxt is lossy
it truncates to 6 significant digits. Use '%.18e' for full float64 round-trip fidelity. This single mistake has corrupted ML pipelines in production.
3
np.save() and np.load() work with binary .npy files
10 to 50x faster than CSV, exact precision, and smaller on disk. Use this format for any numerical data exchange that doesn't need human readability.
4
np.savez() bundles multiple named arrays into a single .npz archive. Access them after loading with loaded['name']. Use np.savez_compressed() for archiving or transfer
not for data that loads repeatedly.
5
np.genfromtxt() handles missing values, mixed types, and comment lines where np.loadtxt would throw. For truly complex tabular data with millions of rows, Pandas read_csv is the more appropriate tool.
6
np.memmap creates a memory-mapped array that reads from disk on access
use it when the array is too large to fit in RAM.
7
Validate loaded data with np.allclose(original, loaded) at pipeline boundaries
a successful load does not mean the values are correct.

Common mistakes to avoid

5 patterns
×

Using CSV for numerical data exchange between pipeline stages

Symptom
Precision loss accumulates silently across stages — model accuracy drops, simulation results drift. Values like 0.000001234567890 get truncated to 1.23457e-06 with fmt='%g'. No exception is thrown anywhere in the pipeline.
Fix
Switch to .npy format for any data handoff that doesn't need human readability: np.save('features.npy', X) on one side, np.load('features.npy') on the other. If CSV is genuinely required for audit or external tools, use fmt='%.18e' in np.savetxt and validate with np.allclose() after loading before proceeding.
×

Not specifying dtype in np.loadtxt

Symptom
Array loads as float64 when the original data was int32, doubling memory usage. Or integer IDs load as floats and downstream integer operations fail. The default dtype=float is applied silently — no warning.
Fix
Always specify dtype= explicitly: np.loadtxt('data.csv', dtype=np.float64) or np.loadtxt('ids.csv', dtype=np.int32). Verify immediately after loading with print(loaded.dtype) and assert loaded.dtype == expected_dtype in any pipeline that depends on a specific type.
×

Using np.loadtxt on files with missing values

Symptom
ValueError: could not convert string to float — the entire load operation fails on the first empty cell. This happens with any sensor data, exported database table, or user-generated CSV that wasn't manually cleaned first.
Fix
Use np.genfromtxt() instead — it fills missing values with NaN by default and continues loading rather than aborting. Specify filling_values if you need something other than NaN for specific columns. For truly complex files, Pandas read_csv handles missing values natively and gives you more control.
×

Loading a large CSV file with np.loadtxt on a memory-constrained machine

Symptom
MemoryError on load, or system memory usage spikes and the process is killed. np.loadtxt reads the entire file into a single ndarray before returning — there is no streaming or chunked reading.
Fix
For binary data that exceeds RAM, use np.memmap which reads lazily from disk. For CSV files specifically, Pandas read_csv with the chunksize parameter processes the file in manageable pieces. The most practical fix if you control the data format: switch to .npy — the same data is smaller on disk and loads without the string-parsing overhead.
×

Indexing a loaded .npz file as if it were an ndarray

Symptom
TypeError or unexpected AttributeError when trying to slice or operate on the result of np.load('file.npz'). The returned object is an NpzFile, which is a dictionary-like container — not an array.
Fix
Access arrays by their saved name: bundle = np.load('file.npz'); features = bundle['features']. Check available keys with list(bundle.keys()). If you're saving a single array and want np.load to return an ndarray directly, use np.save() to produce a .npy file instead of np.savez().
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is the advantage of .npy over CSV for large NumPy arrays?
Q02JUNIOR
How do you save multiple arrays in a single file with NumPy?
Q03SENIOR
What is the difference between np.loadtxt and np.genfromtxt?
Q04SENIOR
When would you use np.memmap instead of np.load?
Q01 of 04JUNIOR

What is the advantage of .npy over CSV for large NumPy arrays?

ANSWER
.npy stores the raw bytes of the array with a small header containing dtype, shape, and memory order metadata — no string conversion happens in either direction. This makes save and load 10 to 50 times faster than CSV, preserves exact floating-point precision without any rounding, and produces files that are typically two to three times smaller than an equivalent CSV. CSV requires converting every number to a string representation on save and parsing every string back to a number on load — that overhead is both slow and lossy for float64 values.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between .npy and .npz files?
02
How do I load a CSV file that has a header row?
03
How do I preserve full float64 precision when saving to CSV?
04
Can I use np.loadtxt for files with missing values?
05
What is np.memmap and when should I use it?
N
Naren Founder & Principal Engineer

20+ years shipping production Python across data and backend systems. Notes here come from systems that actually shipped.

Follow
Verified
production tested
June 10, 2026
last updated
1,554
articles · all by Naren
🔥

That's Python Libraries. Mark it forged?

8 min read · try the examples if you haven't

Previous
NumPy dtype and Memory Layout — float32, int64 and C vs F order
35 / 51 · Python Libraries
Next
NumPy where, select and piecewise — Conditional Array Operations