NumPy savetxt — Float64 Truncated by Default '%g'
Model accuracy dropped from 94% to 61% due to savetxt default fmt='%g' truncating float64.
20+ years shipping production Python across data and backend systems. Notes here come from systems that actually shipped.
- np.loadtxt() reads CSV/TSV text files into arrays — use delimiter and skiprows for control
- np.save() writes binary .npy files — 10-50x faster than CSV for large arrays
- np.savez() bundles multiple named arrays into a single .npz archive
- np.savez_compressed() adds zlib compression — slower to load but smaller on disk
- Binary formats preserve dtype exactly — CSV loses precision on floats unless you use fmt='%.18e'
- For mixed-type tabular data with headers, Pandas read_csv is the more appropriate tool
Imagine you have a notebook full of numbers. Writing them by hand on paper (CSV) is slow, and you might lose precision rounding decimals — especially the very small ones. Taking a photograph of the page (.npy) is instant and captures every digit exactly as it appeared. NumPy's file I/O is fundamentally that choice: do you need something a human can open in a spreadsheet, or do you need speed and precision between machines? Most production pipelines need the latter, and most teams default to the former out of habit.
NumPy's savetxt default fmt='%g' truncates float64 values to ~6 significant digits, silently corrupting data when you round-trip through text files. This caused a model's accuracy to drop from 94% to 61% in one production pipeline — the precision loss from saving feature arrays as CSV propagated into training. The fix is a one-character change to the format string, but understanding when to use text vs. binary I/O prevents this class of bug entirely.
What NumPy savetxt Actually Does to Your Float64 Data
NumPy's savetxt writes array data to a text file using a format string that defaults to '%g'. This format truncates float64 values to approximately 6 significant decimal digits, silently discarding precision. The core mechanic: savetxt converts each element to a string via the format specifier, then writes rows separated by a delimiter. It is the inverse of loadtxt, which parses text back into arrays.
By default, '%g' uses Python's general format: it switches between fixed-point and scientific notation based on magnitude, but caps precision at 6 digits. A float64 can represent about 15–17 significant digits. This mismatch means that round-tripping data through savetxt with default settings introduces errors up to 1e-6 relative, which compounds in iterative algorithms or when comparing results.
Use savetxt when you need human-readable output or interoperability with non-NumPy tools (e.g., CSV for Excel). But never rely on it for lossless storage of float64 data. For precision-critical pipelines — financial calculations, sensor calibration, simulation checkpoints — use NumPy's binary .npy format or HDF5. The default '%g' is a trap for the unwary.
loadtxt and savetxt — Text Files
np.loadtxt() and np.savetxt() are the entry point most people find first, and for good reason — they work with plain text files that any editor, spreadsheet, or downstream tool can open. But they come with constraints that matter the moment you move beyond toy examples.
np.loadtxt() reads the entire file into a single ndarray. Every row must have the same number of columns, and the entire file must be a single dtype — if you mix strings and numbers in the same CSV, it will throw a ValueError. It also reads everything into memory at once, which means a 10GB file needs 10GB of RAM available before you get a single array out.
np.savetxt() writes an ndarray to a text file. The fmt parameter controls how each number is formatted — and this is where the silent precision bug lives. The default '%g' format writes at most 6 significant digits, which is fine for displaying numbers but genuinely lossy for float64 values with more precision than that. If you're using np.savetxt for anything that feeds back into a numerical pipeline, switch fmt to '%.18e', which preserves the full float64 round-trip.
For files that are intended for human inspection — a small sample of predictions, a validation output you're sharing with a stakeholder — CSV is the right choice. For data exchange between pipeline stages, it almost never is.
save and load — Fast Binary Format
np.save() and np.load() use NumPy's native binary format, .npy. The format stores three things: a magic number identifying it as NumPy data, a header containing the dtype, shape, and memory order, and the raw bytes of the array exactly as they exist in memory. There is no string conversion, no parsing, no rounding — what goes in comes back out identically.
The performance difference versus CSV is substantial enough to matter in real pipelines. A 1GB float64 array saves in roughly one second with np.save() — the bottleneck is disk write speed. The same array with np.savetxt() takes 30 to 60 seconds because every number has to be converted to its string representation. On load, the difference is similar. If your pipeline is spending meaningful time reading and writing feature matrices as CSV files, switching to .npy will make that time essentially disappear.
File size follows the same pattern. A .npy file for a float64 array is exactly 8 bytes per element plus a small header — there's no overhead for decimal points, separators, or newlines. A CSV for the same data is typically two to three times larger depending on the magnitude of the values.
np.savez() extends this to multiple arrays. It creates a .npz archive — which is structurally a zip file — where each array is stored under a key you provide. You load the whole archive with np.load() and access individual arrays by name. np.savez_compressed() applies zlib compression on top of that, which reduces file size further at the cost of CPU time on both save and load. Compression is worth it for archiving or transferring data; it's usually not worth it for data that gets loaded repeatedly during training.
genfromtxt — Handling Messy Real-World CSVs
Real-world CSV files are not clean. They have missing values where a sensor didn't record, mixed types because someone exported a database table, comment lines at the top explaining the schema, and inconsistent formatting because the file went through three different tools before it reached you. np.loadtxt() handles none of this — it throws a ValueError the moment it encounters something it can't parse.
np.genfromtxt() is the answer for that middle ground: data that's numerical in intent but imperfect in practice. The key behavioural difference is that genfromtxt fills missing values with a placeholder — NaN by default for floats — rather than aborting. It also supports structured dtypes, which let you mix integer columns with float columns and string columns in the same file, and names=True which reads the header row and gives you column access by name.
The names=True output is a structured array, not a regular ndarray. It behaves differently from what most people expect — operations like data['score'] work, but standard array arithmetic does not apply across the whole structure the way it does with a 2D ndarray. It's a lightweight alternative to a Pandas DataFrame for cases where you need named columns but don't want the Pandas dependency.
For genuinely messy data with millions of rows, complex types, date columns, or anything you'll need to reshape and query heavily, Pandas read_csv is the better tool. genfromtxt sits between np.loadtxt and Pandas: it handles imperfect files, but it's not a DataFrame.
np.column_stack() after loading.Syntax, Parameters, and the Silent Footguns
Every dev skims syntax. Then they ship a file that breaks in prod at 3 AM. Here's what matters.
savetxt signature: numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline=' ', header='', footer='', comments='# ', encoding=None)
fname is your output path. If it ends with .gz, NumPy transparently gzips — great for log dumps. X is your array (1D or 2D). Got a 3D tensor? It'll silently flatten it along the last axis — no warning. That's bitten teams in latency-critical pipelines.
fmt defaults to '%.18e' which prints every double to 18 decimal places in scientific notation. That's overkill 99% of the time. Use '%.6f' for most sensor data unless you genuinely need sub-micron precision. delimiter defaults to space. CSV? Pass ','. Tab? '\t'. Simple.
header and footer write raw strings. comments is the prefix — defaults to '# '. So header='timestamp=2024-04-01' writes # timestamp=2024-04-01. Screw up comments and your header won't be parseable by loadtxt without tweaking comments there too.
encoding defaults to 'latin1'. If you write UTF-8 headers, loading fails silently on older NumPy (<1.14). Always set encoding='utf-8' in modern workflows.
encoding='utf-8' explicitly. The latin1 default will silently encode your header bytes wrong, and loadtxt will fail or produce garbage. Seen it happen in three different CI pipelines.fmt, delimiter, and encoding explicitly — never trust defaults in production.Example: Saving Multiple Arrays the Right Way
Can't pass more than one array to savetxt. It expects a single 2D array. If you try np.savetxt('out.csv', arr1, arr2), you get a TypeError and a facepalm. Stack them first with np.column_stack or np.vstack.
Got two arrays of the same first dimension? column_stack merges them column-wise. Data and labels? Same stride or broadcast — pad or slice. This isn't academic; it's what happens when you log sensor readings with timestamps, coordinates with measurements, or any multi-source data.
You could write a loop — but that's how you end up with misaligned columns and a 5 AM pager. Stack once, write once. Use fmt tuples for mixed precision: fmt=['%d', '%10.2f'] for integer IDs and floating readings. savetxt applies them column-by-column.
If the arrays are incompatible shapes, savetxt won't magically fix it — it'll reshape or truncate silently. Validate before you write.
pandas.DataFrame.to_csv with float_format. It handles mixed types, missing data, and arbitrary column orders. savetxt is fine for uniform numeric arrays — pandas wins for any real CSV.savetxt — always column_stack or hstack them into one 2D array first.Why loadtxt Chokes on Headers — and the One-Liner Fix
You've got a clean CSV with a header row. You call numpy.loadtxt and it throws a ValueError about string conversion. Junior devs blame NumPy. You already know the culprit: every column type gets auto-inferred from the first row of actual data, but loadtxt sees a string header and panics.
The fix is trivial once you know the trap: pass skiprows=1. But that only skips the header — it doesn't save you from mixed types, missing values, or trailing commas. That's when you reach for genfromtxt (covered earlier). But for production pipelines where your data is already validated and clean, loadtxt with skiprows is faster by an order of magnitude.
Real play: if your header contains column names you actually need, use names=True in genfromtxt or roll your own dict mapping with skiprows. Never let a header row be the reason your Monday morning batch job burns down.
skiprows=2 or combine with comments='#' — but verify the count in your actual data contract.savetxt Precision: Why Your 64-Bit Floats Get Truncated at 1e-6
You saved a double-precision array with 16 significant digits. When you load it back, you lose the 7 least significant digits. This isn't a bug — it's the default fmt='%.18e' behavior in savetxt. By default, it writes with 18 digits after the decimal in scientific notation, which fits in a 64-bit float. But if you use fmt='%f' (default 6 decimal places), you're effectively casting to float32 precision.
Why this matters: in production, where your data is sensor readings or financial tick data, truncating to 1e-6 means cumulative error. We saw a machine learning pipeline silently drift because coordinate data got written with fmt='%f' and reloaded — the offset was 0.000001 per row, 1 million rows later you're 1 meter off.
Rule of thumb: never use the default fmt for critical data. Explicitly set fmt='%.15e' or fmt='%.16g' to guarantee lossless round-trip for float64. For integer data, fmt='%d' is fine. For mixed types, you're back to genfromtxt territory — but that's another war story.
fmt='%.16e' with savetxt — it guarantees 15–17 significant digits and works with any delimiter. Test once, then forget it.ML Pipeline Loaded CSV with Wrong Precision — Training Ran on Corrupted Features
- CSV is not lossless for floating-point data — 6 significant digits is a lot less precision than float64 carries
- fmt='%g' is the savetxt default and it silently truncates — this should arguably be a warning in the NumPy docs
- Always specify dtype= explicitly in np.loadtxt — the default may not match what was saved
- np.allclose() validation at pipeline boundaries catches precision drift before it reaches training
- The absence of an error does not mean the data is correct — CSV round-trips are quiet even when they're lossy
loaded.keys()). If you only have one array and want direct ndarray access, save it with np.save() as a .npy file instead.print(loaded.dtype, loaded.shape)np.loadtxt('data.csv', dtype=np.float64, delimiter=',')Key takeaways
np.savetxt() work with text files. Use delimiter and skiprows for CSV, and specify dtype= explicitlynp.load() work with binary .npy filesnp.savez_compressed() for archiving or transferCommon mistakes to avoid
5 patternsUsing CSV for numerical data exchange between pipeline stages
np.allclose() after loading before proceeding.Not specifying dtype in np.loadtxt
Using np.loadtxt on files with missing values
np.genfromtxt() instead — it fills missing values with NaN by default and continues loading rather than aborting. Specify filling_values if you need something other than NaN for specific columns. For truly complex files, Pandas read_csv handles missing values natively and gives you more control.Loading a large CSV file with np.loadtxt on a memory-constrained machine
Indexing a loaded .npz file as if it were an ndarray
bundle.keys()). If you're saving a single array and want np.load to return an ndarray directly, use np.save() to produce a .npy file instead of np.savez().Interview Questions on This Topic
What is the advantage of .npy over CSV for large NumPy arrays?
Frequently Asked Questions
20+ years shipping production Python across data and backend systems. Notes here come from systems that actually shipped.
That's Python Libraries. Mark it forged?
8 min read · try the examples if you haven't