NumPy savetxt — Float64 Truncated by Default '%g'
Model accuracy dropped from 94% to 61% due to savetxt default fmt='%g' truncating float64.
- np.loadtxt() reads CSV/TSV text files into arrays — use delimiter and skiprows for control
- np.save() writes binary .npy files — 10-50x faster than CSV for large arrays
- np.savez() bundles multiple named arrays into a single .npz archive
- np.savez_compressed() adds zlib compression — slower to load but smaller on disk
- Binary formats preserve dtype exactly — CSV loses precision on floats unless you use fmt='%.18e'
- For mixed-type tabular data with headers, Pandas read_csv is the more appropriate tool
Imagine you have a notebook full of numbers. Writing them by hand on paper (CSV) is slow, and you might lose precision rounding decimals — especially the very small ones. Taking a photograph of the page (.npy) is instant and captures every digit exactly as it appeared. NumPy's file I/O is fundamentally that choice: do you need something a human can open in a spreadsheet, or do you need speed and precision between machines? Most production pipelines need the latter, and most teams default to the former out of habit.
loadtxt and savetxt — Text Files
np.loadtxt() and np.savetxt() are the entry point most people find first, and for good reason — they work with plain text files that any editor, spreadsheet, or downstream tool can open. But they come with constraints that matter the moment you move beyond toy examples.
np.loadtxt() reads the entire file into a single ndarray. Every row must have the same number of columns, and the entire file must be a single dtype — if you mix strings and numbers in the same CSV, it will throw a ValueError. It also reads everything into memory at once, which means a 10GB file needs 10GB of RAM available before you get a single array out.
np.savetxt() writes an ndarray to a text file. The fmt parameter controls how each number is formatted — and this is where the silent precision bug lives. The default '%g' format writes at most 6 significant digits, which is fine for displaying numbers but genuinely lossy for float64 values with more precision than that. If you're using np.savetxt for anything that feeds back into a numerical pipeline, switch fmt to '%.18e', which preserves the full float64 round-trip.
For files that are intended for human inspection — a small sample of predictions, a validation output you're sharing with a stakeholder — CSV is the right choice. For data exchange between pipeline stages, it almost never is.
save and load — Fast Binary Format
np.save() and np.load() use NumPy's native binary format, .npy. The format stores three things: a magic number identifying it as NumPy data, a header containing the dtype, shape, and memory order, and the raw bytes of the array exactly as they exist in memory. There is no string conversion, no parsing, no rounding — what goes in comes back out identically.
The performance difference versus CSV is substantial enough to matter in real pipelines. A 1GB float64 array saves in roughly one second with np.save() — the bottleneck is disk write speed. The same array with np.savetxt() takes 30 to 60 seconds because every number has to be converted to its string representation. On load, the difference is similar. If your pipeline is spending meaningful time reading and writing feature matrices as CSV files, switching to .npy will make that time essentially disappear.
File size follows the same pattern. A .npy file for a float64 array is exactly 8 bytes per element plus a small header — there's no overhead for decimal points, separators, or newlines. A CSV for the same data is typically two to three times larger depending on the magnitude of the values.
np.savez() extends this to multiple arrays. It creates a .npz archive — which is structurally a zip file — where each array is stored under a key you provide. You load the whole archive with np.load() and access individual arrays by name. np.savez_compressed() applies zlib compression on top of that, which reduces file size further at the cost of CPU time on both save and load. Compression is worth it for archiving or transferring data; it's usually not worth it for data that gets loaded repeatedly during training.
genfromtxt — Handling Messy Real-World CSVs
Real-world CSV files are not clean. They have missing values where a sensor didn't record, mixed types because someone exported a database table, comment lines at the top explaining the schema, and inconsistent formatting because the file went through three different tools before it reached you. np.loadtxt() handles none of this — it throws a ValueError the moment it encounters something it can't parse.
np.genfromtxt() is the answer for that middle ground: data that's numerical in intent but imperfect in practice. The key behavioural difference is that genfromtxt fills missing values with a placeholder — NaN by default for floats — rather than aborting. It also supports structured dtypes, which let you mix integer columns with float columns and string columns in the same file, and names=True which reads the header row and gives you column access by name.
The names=True output is a structured array, not a regular ndarray. It behaves differently from what most people expect — operations like data['score'] work, but standard array arithmetic does not apply across the whole structure the way it does with a 2D ndarray. It's a lightweight alternative to a Pandas DataFrame for cases where you need named columns but don't want the Pandas dependency.
For genuinely messy data with millions of rows, complex types, date columns, or anything you'll need to reshape and query heavily, Pandas read_csv is the better tool. genfromtxt sits between np.loadtxt and Pandas: it handles imperfect files, but it's not a DataFrame.
np.column_stack() after loading.ML Pipeline Loaded CSV with Wrong Precision — Training Ran on Corrupted Features
- CSV is not lossless for floating-point data — 6 significant digits is a lot less precision than float64 carries
- fmt='%g' is the savetxt default and it silently truncates — this should arguably be a warning in the NumPy docs
- Always specify dtype= explicitly in np.loadtxt — the default may not match what was saved
- np.allclose() validation at pipeline boundaries catches precision drift before it reaches training
- The absence of an error does not mean the data is correct — CSV round-trips are quiet even when they're lossy
loaded.keys()). If you only have one array and want direct ndarray access, save it with np.save() as a .npy file instead.Key takeaways
np.savetxt() work with text files. Use delimiter and skiprows for CSV, and specify dtype= explicitlynp.load() work with binary .npy filesnp.savez_compressed() for archiving or transferCommon mistakes to avoid
5 patternsUsing CSV for numerical data exchange between pipeline stages
np.allclose() after loading before proceeding.Not specifying dtype in np.loadtxt
Using np.loadtxt on files with missing values
np.genfromtxt() instead — it fills missing values with NaN by default and continues loading rather than aborting. Specify filling_values if you need something other than NaN for specific columns. For truly complex files, Pandas read_csv handles missing values natively and gives you more control.Loading a large CSV file with np.loadtxt on a memory-constrained machine
Indexing a loaded .npz file as if it were an ndarray
bundle.keys()). If you're saving a single array and want np.load to return an ndarray directly, use np.save() to produce a .npy file instead of np.savez().Interview Questions on This Topic
What is the advantage of .npy over CSV for large NumPy arrays?
Frequently Asked Questions
That's Python Libraries. Mark it forged?
3 min read · try the examples if you haven't