Skip to content
Home Python Python File Handling — 'w' Mode Truncates on Open

Python File Handling — 'w' Mode Truncates on Open

Where developers are forged. · Structured learning · Free forever.
📍 Part of: File Handling → Topic 1 of 6
Using 'w' mode truncates files on open(), not write().
⚙️ Intermediate — basic Python knowledge assumed
In this tutorial, you'll learn
Using 'w' mode truncates files on open(), not write().
  • Always use 'with open(...) as f:' — it's not just style, it's a resource safety guarantee that prevents file descriptor leaks and ensures buffers are flushed to disk.
  • 'w' mode truncates the file to zero bytes the instant open() is called, before any write() happens — if you need existing content to survive, you want 'a' mode instead.
  • Iterate the file object directly ('for line in file:') rather than calling read() — this keeps memory usage constant for files of any size, which is the difference between a script that scales and one that doesn't.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • The with statement guarantees file close even on exceptions
  • 'r' mode reads; 'w' truncates existing file on open; 'a' appends
  • Use encoding='utf-8' to avoid UnicodeDecodeError on Windows
  • Iterating the file object reads one line at a time — O(1) memory per line
  • For large files, avoid read() — it loads the whole file into RAM
  • Biggest trap: opening with 'w' and crashing — file is already empty
🚨 START HERE

Quick Debug Cheat Sheet: File Handling

Run these commands when file operations behave unexpectedly in production
🟡

File descriptor limit exceeded

Immediate ActionFind PID of the Python process, then list open files.
Commands
lsof -p $(pgrep -f your_script.py) | grep -c 'REG'
ulimit -n
Fix NowWrap all open() calls with `with` statement. For long-running apps, use `contextlib.closing()` for resources that don't support `with`.
🟡

UnicodeDecodeError crash

Immediate ActionFind the file encoding and open with the correct encoding.
Commands
python -c 'import chardet; print(chardet.detect(open("problematic.txt","rb").read()))'
file problematic.txt
Fix NowAlways pass `encoding='utf-8'` as default. If the file might have mixed encodings, read in `'rb'` and use `.decode()` with `errors='backslashreplace'`.
🟡

Write didn't appear in file

Immediate ActionCheck if file was flushed; ensure the `with` block completed.
Commands
python -c 'import os; print(os.stat("output.txt").st_size)'
cat output.txt | wc -l
Fix NowWrap write logic in try/finally or use `with` to guarantee flush. For high-reliability writes, call `file.flush()` and `os.fsync(file.fileno())` after critical writes.
🟡

`file.read()` uses 2 GB RAM for a 2 GB file

Immediate ActionStop using `read()` without argument. Switch to line-iterating or chunked reading.
Commands
wc -l hugefile.txt
python -c 'for i, line in enumerate(open("hugefile.txt","r")): pass; print(i+1)'
Fix NowReplace `data = file.read()` with `for chunk in iter(lambda: file.read(65536), ''): process(chunk)` for binary, or iterate lines for text.
Production Incident

Silent Log Loss: The Production File That Vanished

A multi-threaded microservice wrote audit logs to a file, but after a hotfix rollback, three hours of logs disappeared. The team assumed logs were in append mode — they weren't.
SymptomAfter a hotfix deploy and immediate rollback, the audit log file from the previous three hours was empty. New logs started appearing, but the prior window's data was gone.
AssumptionThe audit logger used 'a' mode, so log lines should accumulate. The rollback should only affect new code, not historical data.
Root causeDuring the hotfix, the dev mistakenly changed the file open mode to 'w' to regenerate a test file. The change was deployed, the file was truncated on the first open() call after deploy, and the rollback didn't recover the old content because 'w' destroyed it before any write completed.
FixImplemented an audit log rotation with date-based filenames and a dedicated logging module handler that explicitly uses 'a' with no chance of mode override via config. Added a lint rule banning 'w' mode in audit paths.
Key Lesson
'w' mode truncates the file at open(), not at write().If your application has any local file output that must survive across runs, use 'a' mode or date-stamped filenames.Better yet, centralize logging to a service — never trust file mode conventions.
Production Debug Guide

Symptom → Action guide for real problems you'll face in production

Too many open files error (OSError 24)Count open file handles: lsof -p <pid> | wc -l. Check for missing with statements — every open() without with is a leak. Use ulimit -n to see the system limit. Add a monitoring alert on file descriptor usage.
UnicodeDecodeError when reading a file that looks like textPrint the file's encoding: chardet.detect(open('file','rb').read(10000)). If you can't detect, fall back to binary mode 'rb' and decode manually with error handling. Never assume UTF-8; always specify encoding in production.
File exists but open() raises FileNotFoundErrorCheck whether it's a symlink pointing to a non-existent path, or the file is inside a directory mounted as a filesystem that's not available. Also check if the file path contains trailing whitespace or invisible characters — print repr(path) to verify.
File written but content is empty or truncatedCheck if you're using 'w' mode by accident — log the mode string at open time. Use file.tell() before writing to confirm cursor position. For safety, always write to a .tmp path and then os.rename() to the target — rename is atomic on most filesystems.

Every meaningful program eventually needs to talk to the outside world — not just the screen, but actual storage. Log files, configuration files, CSV exports, user-uploaded text, cached API responses — all of it lives on disk as files. If your Python script can't read or write files, it's essentially a calculator that resets every time you switch it off. File handling is the bridge between your running program and data that survives after the program exits.

The problem Python file handling solves is deceptively simple: how do you safely open a resource, use it, and guarantee it gets released — even when something goes wrong? Languages that don't enforce this leave files locked open, causing data corruption, permission errors, and crashes that only appear under load. Python's answer is the context manager (the with statement), which makes safe file handling the path of least resistance rather than an afterthought.

By the end of this article you'll know how to open files for reading, writing, and appending; understand exactly what happens behind the scenes when you do; handle errors like a professional; work with both text and binary files; and recognise the two or three patterns that cover 95% of real-world file work. You'll also know the mistakes that trip up even experienced developers — so you can skip straight past them.

Opening and Reading Files — and Why the 'with' Statement Exists

Before you can do anything with a file, Python needs a file object — a live connection to that file on disk. You create one with the built-in open() function. The two most important arguments are the file path and the mode: 'r' for read, 'w' for write, 'a' for append, and 'b' tacked on for binary (e.g., 'rb').

Here's the thing most tutorials gloss over: open() acquires an operating-system resource. The OS gives your process a file descriptor — a numbered slot in a limited table. If you open files without closing them, you eventually exhaust that table and get a Too many open files error. Worse, unflushed writes may never reach disk.

The with statement solves this by acting as a guaranteed cleanup mechanism. It calls file.close() the instant the indented block exits — whether that exit is normal, via return, or via a crashing exception. You should use with open(...) every single time, with no exceptions. The only reason the two-line open() / close() pattern still exists in docs is historical; treat it as legacy.

Reading has three flavours: read() loads the entire file into one string (fine for small files, dangerous for large ones), readline() fetches one line at a time, and readlines() returns a list of all lines. For most real work, iterating the file object directly is the cleanest and most memory-efficient approach — Python streams one line at a time without loading everything.

read_server_log.py · PYTHON
1234567891011121314151617181920212223242526272829303132
# Scenario: parse a server log file and count how many lines are ERROR level

log_file_path = "server.log"

# Create a sample log file to work with so this script is self-contained
with open(log_file_path, "w", encoding="utf-8") as log_file:
    log_file.write("INFO  2024-06-01 08:00:01 Server started\n")
    log_file.write("INFO  2024-06-01 08:01:15 Request received from 192.168.1.10\n")
    log_file.write("ERROR 2024-06-01 08:01:16 Database connection timeout\n")
    log_file.write("INFO  2024-06-01 08:02:00 Retrying connection\n")
    log_file.write("ERROR 2024-06-01 08:02:05 Max retries exceeded\n")
    log_file.write("INFO  2024-06-01 08:02:06 Falling back to cache\n")

error_count = 0
error_lines = []

# The 'with' block guarantees the file is closed when we're done,
# even if an exception is raised inside the block.
with open(log_file_path, "r", encoding="utf-8") as log_file:
    # Iterating the file object directly reads ONE line at a time —
    # this works correctly even for a 10 GB log file because Python
    # never loads the whole thing into memory at once.
    for line in log_file:
        stripped_line = line.strip()  # remove trailing newline characters
        if stripped_line.startswith("ERROR"):
            error_count += 1
            error_lines.append(stripped_line)

print(f"Total ERROR lines found: {error_count}")
print("\nError details:")
for error in error_lines:
    print(f"  -> {error}")
▶ Output
Total ERROR lines found: 2

Error details:
-> ERROR 2024-06-01 08:01:16 Database connection timeout
-> ERROR 2024-06-01 08:02:05 Max retries exceeded
💡Pro Tip: Always Specify Encoding
Always pass encoding='utf-8' to open(). Without it, Python uses the platform's default encoding — which is UTF-8 on Mac/Linux but often CP1252 on Windows. That mismatch is the cause of countless 'UnicodeDecodeError' bugs that only appear on certain machines. Make UTF-8 your default and move on.
📊 Production Insight
A production batch job reading CSV files on a Windows server crashed daily at 2 PM. The input file was created on a Mac and contained en-dash characters. Python's default encoding on Windows is CP1252, which fails on en-dash (0x2013). The fix was adding encoding='utf-8' — the file was actually UTF-8, but the default assumption broke everything.
Rule: if you don't control file creation, always call chardet.detect() on the first bytes and pass the result to open(). Never trust platform defaults.
🎯 Key Takeaway
Use with open(...) every single time.
Specify encoding='utf-8' explicitly.
Iterate lines, don't call read() on large files.

Writing and Appending — Understanding Why 'w' Mode Is a Trap

Writing to a file sounds simple — and it is, once you understand the critical difference between 'w' (write) and 'a' (append) mode, because confusing them is one of the most common ways to destroy your own data.

Opening a file in 'w' mode does two things: it creates the file if it doesn't exist, and it truncates the file to zero bytes if it does exist — immediately, before you write a single character. That means opening an existing file with 'w' and then crashing before writing anything leaves you with an empty file. The original content is gone.

Append mode ('a') is safer for ongoing data: the file is created if absent, but if it exists, the write cursor starts at the very end. Existing content is untouched. This is exactly what you want for log files, audit trails, or any data you're accumulating over time.

For writing structured data — think generating a report or saving configuration — 'w' is correct because you intentionally want a fresh file each run. For recording events as they happen over time, 'a' is correct. Choosing wrong corrupts data silently, which is why understanding the WHY matters more than memorising the letters.

write() takes a single string. Unlike print(), it does not add a newline automatically — you must include ' ' yourself. writelines() accepts a list of strings and also skips automatic newlines, so each string in your list must already end with ' ' if you want line breaks.

user_activity_logger.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import datetime

activity_log_path = "user_activity.log"

def log_activity(username: str, action: str) -> None:
    """Append a timestamped activity record to the log file.

    We use 'a' mode so that every call adds to the existing log
    rather than overwriting it. This function is safe to call
    from multiple places in a long-running application.
    """
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    # Build the full log entry as a single string with its own newline
    log_entry = f"[{timestamp}] USER={username} ACTION={action}\n"

    with open(activity_log_path, "a", encoding="utf-8") as activity_log:
        activity_log.write(log_entry)  # 'a' mode: cursor is always at end of file

def generate_daily_report(report_date: str, summary_lines: list[str]) -> None:
    """Write a fresh daily report, replacing any previous report for the same name.

    We intentionally use 'w' mode here because the report is always
    regenerated from scratch — old content should not carry over.
    """
    report_path = f"report_{report_date}.txt"
    header = f"=== Daily Report for {report_date} ===\n\n"

    with open(report_path, "w", encoding="utf-8") as report_file:
        report_file.write(header)
        # writelines() writes each string as-is — no automatic newlines added
        # so we ensure each summary line already ends with '\n'
        report_file.writelines(line + "\n" for line in summary_lines)

    print(f"Report saved to {report_path}")

# Simulate three user events arriving over time
log_activity("alice", "LOGIN")
log_activity("bob", "VIEWED_DASHBOARD")
log_activity("alice", "EXPORTED_CSV")

# Read back the log to confirm all three entries were preserved
print("--- Current activity log ---")
with open(activity_log_path, "r", encoding="utf-8") as activity_log:
    print(activity_log.read())

# Generate a report (uses 'w' mode — intentionally fresh each time)
generate_daily_report(
    report_date="2024-06-01",
    summary_lines=[
        "Total logins: 1",
        "Total page views: 1",
        "Total exports: 1"
    ]
)
▶ Output
--- Current activity log ---
[2024-06-01 09:15:01] USER=alice ACTION=LOGIN
[2024-06-01 09:15:02] USER=bob ACTION=VIEWED_DASHBOARD
[2024-06-01 09:15:03] USER=alice ACTION=EXPORTED_CSV

Report saved to report_2024-06-01.txt
⚠ Watch Out: 'w' Mode Deletes First, Asks Questions Later
The file truncation in 'w' mode happens the instant open() is called — not when you call write(). If your next line crashes, the file is already empty. For any file you care about, consider writing to a temporary file first and renaming it over the original only after a successful write. This is the atomic write pattern and it's how professional tools like text editors protect your data.
📊 Production Insight
An ETL pipeline that wrote daily CSV exports used 'w' mode by default. A connection timeout hit between open() and write() — the output file was truncated, and the downstream system ingested an empty file, overwriting the previous day's correct data. Recovery required a database restore.
Rule: use atomic writes for any file that another system consumes. Write to a .tmp path, then os.rename() to the final name.
🎯 Key Takeaway
'w' truncates on open, not on write.
'a' preserves existing content.
For critical files, use atomic write pattern with rename.

Error Handling and File Existence — Writing Code That Doesn't Embarrass You in Production

A file operation in production will eventually fail. The path doesn't exist, the disk is full, the process doesn't have permission, the file is locked by another process. Ignoring this is fine in a throwaway script and a fireable offence in production code.

Python raises specific exceptions for file errors. FileNotFoundError fires when you try to read a file that doesn't exist. PermissionError fires when the OS blocks access. IsADirectoryError fires when you accidentally pass a directory path where a file path was expected. All three are subclasses of OSError, so catching OSError covers all of them — but catching the specific type gives you better error messages.

A common anti-pattern is checking os.path.exists() before opening a file. This looks safe but it's a race condition: between your check and your open(), another process can delete or create that file. The correct pattern is EAFP (Easier to Ask Forgiveness than Permission) — just try to open it and handle the exception. Python was designed around this idiom.

For reading a file that might not exist yet (like a config file on first run), a clean pattern is to catch FileNotFoundError and return a sensible default rather than crashing. For writing, you often want the opposite: verify the destination directory exists and create it if not, using pathlib.Path.mkdir(parents=True, exist_ok=True) before writing.

config_manager.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import json
from pathlib import Path

DEFAULT_CONFIG = {
    "theme": "dark",
    "language": "en",
    "auto_save_interval_seconds": 30
}

CONFIG_PATH = Path("app_data") / "config.json"

def load_config() -> dict:
    """Load config from disk. Returns defaults if the file doesn't exist yet.

    Uses EAFP style: try the operation, handle the specific failure.
    This avoids the race condition that os.path.exists() creates.
    """
    try:
        with open(CONFIG_PATH, "r", encoding="utf-8") as config_file:
            loaded_config = json.load(config_file)
            print(f"Config loaded from {CONFIG_PATH}")
            return loaded_config

    except FileNotFoundError:
        # This is expected on first run — not an error, just a first boot
        print(f"No config file found at {CONFIG_PATH}. Using defaults.")
        return DEFAULT_CONFIG.copy()

    except json.JSONDecodeError as parse_error:
        # The file exists but is malformed — this IS an error worth reporting
        print(f"WARNING: Config file is corrupted ({parse_error}). Using defaults.")
        return DEFAULT_CONFIG.copy()

    except PermissionError:
        # We can't read the file — fail loudly, don't silently use defaults
        raise RuntimeError(
            f"Cannot read config at {CONFIG_PATH}. Check file permissions."
        )

def save_config(config: dict) -> None:
    """Save config to disk, creating the directory structure if it doesn't exist."""
    # mkdir with parents=True creates 'app_data/' if it's missing
    # exist_ok=True means no error if the directory is already there
    CONFIG_PATH.parent.mkdir(parents=True, exist_ok=True)

    with open(CONFIG_PATH, "w", encoding="utf-8") as config_file:
        # indent=2 makes the JSON human-readable — important for config files
        json.dump(config, config_file, indent=2)
        print(f"Config saved to {CONFIG_PATH}")

# First load — file doesn't exist yet
current_config = load_config()
print(f"Theme: {current_config['theme']}")

# User changes a setting
current_config["theme"] = "light"
save_config(current_config)

# Second load — now reads from disk
current_config = load_config()
print(f"Theme after reload: {current_config['theme']}")
▶ Output
No config file found at app_data/config.json. Using defaults.
Theme: dark
Config saved to app_data/config.json
Config loaded from app_data/config.json
Theme after reload: light
🔥Interview Gold: EAFP vs LBYL
Python has two philosophies for handling uncertain operations. LBYL (Look Before You Leap) checks conditions first: if os.path.exists(path): open(path). EAFP (Easier to Ask Forgiveness than Permission) just tries it: try: open(path) except FileNotFoundError. Python officially prefers EAFP because it's faster (no double stat call), race-condition-free, and more readable. Interviewers love this distinction — knowing the names and reasons will set you apart.
📊 Production Insight
A Django app used os.path.exists() before opening a user-uploaded file. Under high load, a race condition occurred: two requests processed the same upload simultaneously — one deleted the file after the existence check but before open(). The second request crashed with a 500 error. The user got an opaque error page.
Rule: never check-then-open. Always use try/except with the specific exception. For uploads, copy the file to a temp location before processing.
🎯 Key Takeaway
Use EAFP: try to open, catch exceptions.
os.path.exists() is a TOCTOU race condition.
Always create parent directories before writing.

Binary Files and pathlib — Handling Images, PDFs and Modern Path Management

Not all files are text. Images, PDFs, audio, compiled code, and serialised data are binary — they contain bytes that aren't valid UTF-8 text. Open them in text mode and you'll get a UnicodeDecodeError at best, or silently corrupted data at worst. Binary mode ('rb', 'wb') tells Python to skip all encoding/decoding and work with raw bytes.

A very common real-world task is copying or processing binary files — resizing images, attaching files to emails, or storing uploaded files from a web form. The pattern is identical to text mode but you read bytes objects instead of strings.

For path manipulation, the modern way in Python 3.4+ is pathlib.Path. Forget string concatenation with os.path.join()pathlib lets you build paths with / operator, check existence with .exists(), get the file extension with .suffix, list directory contents with .iterdir(), and open files directly via path.open(). It's more readable, cross-platform by default, and object-oriented in a way that makes your intent clear.

When you're processing a directory full of files — a common automation task — pathlib with a generator expression is significantly cleaner than os.listdir() combined with string filtering. The pattern Path('data').glob('*.csv') gives you an iterator of all CSV files in the directory, ready to open.

file_organiser.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344
from pathlib import Path
import shutil

# Scenario: scan a 'downloads' folder and copy image files into an 'images' archive.
# This pattern works identically on Windows, macOS and Linux because pathlib
# handles the slash vs backslash difference for you automatically.

DOWNLOADS_DIR = Path("sample_downloads")
IMAGES_ARCHIVE_DIR = Path("organised") / "images"

# Create sample directory with mixed file types so the script is self-contained
DOWNLOADS_DIR.mkdir(exist_ok=True)
(DOWNLOADS_DIR / "holiday_photo.jpg").write_bytes(b"\xff\xd8\xff" + b"\x00" * 10)
(DOWNLOADS_DIR / "budget.xlsx").write_bytes(b"PK\x03\x04" + b"\x00" * 10)
(DOWNLOADS_DIR / "profile_pic.png").write_bytes(b"\x89PNG" + b"\x00" * 10)
(DOWNLOADS_DIR / "notes.txt").write_text("Some text notes", encoding="utf-8")
(DOWNLOADS_DIR / "logo.jpg").write_bytes(b"\xff\xd8\xff" + b"\x00" * 10)

IMAGES_ARCHIVE_DIR.mkdir(parents=True, exist_ok=True)

image_extensions = {".jpg", ".jpeg", ".png", ".gif", ".webp"}
copy_count = 0

# Path.iterdir() yields Path objects for every item in the directory —
# no string path manipulation needed, no os.path.join() required.
for file_path in DOWNLOADS_DIR.iterdir():
    # .is_file() filters out subdirectories
    # .suffix gives the file extension including the dot, e.g. '.jpg'
    if file_path.is_file() and file_path.suffix.lower() in image_extensions:
        destination = IMAGES_ARCHIVE_DIR / file_path.name

        # shutil.copy2 copies file content AND metadata (timestamps etc.)
        shutil.copy2(file_path, destination)
        copy_count += 1
        print(f"  Copied: {file_path.name} -> {destination}")

print(f"\nDone. {copy_count} image file(s) archived to {IMAGES_ARCHIVE_DIR}")

# Demonstrate reading binary content back
for image_path in IMAGES_ARCHIVE_DIR.glob("*.jpg"):
    # 'rb' mode returns raw bytes — no encoding involved
    with image_path.open("rb") as image_file:
        first_bytes = image_file.read(3)  # JPEG magic bytes are FF D8 FF
        print(f"  {image_path.name} magic bytes: {first_bytes.hex().upper()}")
▶ Output
Copied: holiday_photo.jpg -> organised/images/holiday_photo.jpg
Copied: profile_pic.png -> organised/images/profile_pic.png
Copied: logo.jpg -> organised/images/logo.jpg

Done. 3 image file(s) archived to organised/images
holiday_photo.jpg magic bytes: FFD8FF
logo.jpg magic bytes: FFD8FF
💡Pro Tip: Use pathlib for Everything Path-Related
If you're still writing os.path.join(base_dir, 'subfolder', filename), switch to pathlib today. Path(base_dir) / 'subfolder' / filename does exactly the same thing and is immediately readable. pathlib objects also work directly with open(), shutil, and most standard library functions, so there's no conversion overhead — it's a straight upgrade.
📊 Production Insight
A team's deployment script used os.path.join with hardcoded backslashes for a Windows path. When they moved to Linux Docker containers, the path broke because os.path.join doesn't handle mixed separators. Switching to pathlib.Path fixed it — the code now runs identically on both platforms.
Rule: use pathlib for any path that might cross operating system boundaries. It's not just cleaner — it's portable.
🎯 Key Takeaway
Binary files need 'rb'/'wb' — text mode corrupts them.
pathlib is the modern, cross-platform way to handle paths.
Use .glob() and .iterdir() for bulk file operations.

Working with CSV Files — The Most Common Production File Format

CSV files are everywhere. Exports from databases, spreadsheets, logs, API responses — CSV is the lingua franca of data exchange. But CSV handling has traps: quoting, encoding, newlines inside fields, and missing headers.

Python's csv module handles most of this correctly if you use it right. The common beginner mistake is reading a CSV file manually with for line in file: and splitting on commas — this breaks the moment a field contains a comma or a quoted string. Always use csv.reader or csv.DictReader.

For production, always specify quoting=csv.QUOTE_MINIMAL (the default) for writing, and quoting=csv.QUOTE_NONNUMERIC for reading if all fields should be strings. Use newline='' when opening the file — otherwise the CSV module's newline handling can double up \r on Windows.

Encoding is the biggest silent killer. A CSV file from a French office might be in cp1252 or latin-1. Always pass encoding explicitly, and if you don't know the origin, use encoding='utf-8-sig' to handle the BOM that Excel loves to add.

clean_csv.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839
import csv
from pathlib import Path

# Scenario: read a messy CSV exported from Excel (has BOM, inconsistent quotes)
# and write a clean UTF-8 CSV with standard formatting.

input_path = Path("sales_export.csv")
output_path = Path("sales_cleaned.csv")

# Create a sample 'messy' CSV file for demonstration
with open(input_path, "w", encoding="utf-8-sig") as f:
    f.write("\ufeffProduct,Price,Quantity\r\n")
    f.write('Widget A,"$12.50",3\r\n')
    f.write('Widget B,"$24.99",1\r\n')
    f.write('Widget C,"$5.00",10\r\n')

def clean_price(price_str: str) -> float:
    """Remove currency symbols and whitespace, return float."""
    return float(price_str.replace("$", "").replace(",", "").strip())

# Open with newline='' — critical for csv module portability
with open(input_path, "r", encoding="utf-8-sig", newline='') as infile, \
     open(output_path, "w", encoding="utf-8", newline='') as outfile:

    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in reader:
        # Clean price field
        row["Price"] = f'{clean_price(row["Price"]):.2f}'
        writer.writerow(row)

print(f"Cleaned CSV written to {output_path}")

# Verify output
with open(output_path, "r", encoding="utf-8") as f:
    print(f.read())
▶ Output
Cleaned CSV written to sales_cleaned.csv
Product,Price,Quantity
Widget A,12.50,3
Widget B,24.99,1
Widget C,5.00,10
⚠ CSV Pitfall: newline='' is Not Optional
When opening a CSV file for reading or writing, always pass newline=''. Without it, the csv module's internal newline handling conflicts with the file object's default newline translation on Windows, causing doubled \r\n or truncated rows. This bug is extremely subtle because it only shows up on Windows or when the file contains \r\n line endings.
📊 Production Insight
A data pipeline ingested CSV files from a customer's ERP system. The files were generated on a Windows server with cp1252 encoding and a BOM. The pipeline assumed UTF-8 and failed on every accent character. encoding='utf-8-sig' fixed it — it strips the BOM and falls back to UTF-8 for the rest. Always probe encoding or use utf-8-sig for files that may come from Windows.
Rule: never assume the encoding of externally sourced CSV files. Use chardet for detection or default to utf-8-sig for Windows-origin files.
🎯 Key Takeaway
Use csv.DictReader — never split manually.
Open CSV files with newline=''.
Handle encoding with utf-8-sig for Windows exports.
🗂 Text Mode vs Binary Mode
When to use each and what changes
AspectText Mode ('r', 'w', 'a')Binary Mode ('rb', 'wb', 'ab')
Data type returnedstrbytes
Encoding appliedYes — uses specified or platform encodingNo — raw bytes only
Newline translationYes — '\r\n' on Windows becomes '\n'No — bytes are untouched
Use caseLogs, config, CSV, JSON, source codeImages, PDFs, audio, executables, pickled data
Error on bad bytesUnicodeDecodeError if file isn't valid textNever — all byte sequences are valid
File size considerationSlightly smaller in memory (str interning)Exact byte-for-byte copy in memory

🎯 Key Takeaways

  • Always use 'with open(...) as f:' — it's not just style, it's a resource safety guarantee that prevents file descriptor leaks and ensures buffers are flushed to disk.
  • 'w' mode truncates the file to zero bytes the instant open() is called, before any write() happens — if you need existing content to survive, you want 'a' mode instead.
  • Iterate the file object directly ('for line in file:') rather than calling read() — this keeps memory usage constant for files of any size, which is the difference between a script that scales and one that doesn't.
  • Use pathlib.Path instead of os.path string operations — it's cross-platform, reads like English, and integrates cleanly with open(), glob(), mkdir(), and the rest of the standard library.
  • For CSV files, always open with newline='' and specify encoding explicitly — the defaults cause subtle cross-platform corruption.

⚠ Common Mistakes to Avoid

    Opening a file without 'with' and forgetting to call close()
    Symptom

    Intermittent 'Too many open files' OSError in long-running apps, or writes that never appear on disk because the buffer was never flushed.

    Fix

    Always use with open(...) as f: with no exceptions. If you're in a class and must store the file object, implement __enter__ and __exit__ properly, or use contextlib.closing().

    Using 'w' mode on a file you meant to append to
    Symptom

    The file exists after your script runs but contains only the most recent run's data; all previous data is silently gone.

    Fix

    Ask yourself: should old content survive this write? If yes, use 'a'. If no (you're regenerating the file intentionally), use 'w'. Never default to 'w' without thinking about it.

    Reading an entire large file into memory with read()
    Symptom

    Your script works fine on a 10 KB test file, then crashes with MemoryError or causes the server to swap when pointed at a 2 GB production log.

    Fix

    Iterate the file object line by line (for line in file_object:) for text files, or use file.read(chunk_size) in a loop for binary files. This keeps memory usage flat regardless of file size.

    Not specifying encoding and getting UnicodeDecodeError on a different platform
    Symptom

    A script works on your Mac but throws UnicodeDecodeError when run on a Windows server or vice versa.

    Fix

    Always pass encoding='utf-8' explicitly. For files from external sources, detect encoding with chardet or use encoding='utf-8-sig' to handle BOM.

    Using `os.path.exists()` before opening a file (TOCTOU race)
    Symptom

    Intermittent crash when a file is deleted or created between the existence check and open(). Especially common in concurrent applications.

    Fix

    Replace if os.path.exists(path): open(path) with try: open(path) except FileNotFoundError: handle() — EAFP style.

Interview Questions on This Topic

  • QWhat is the difference between opening a file in 'w' and 'a' mode, and what happens to an existing file's content the moment you call open() with 'w'?JuniorReveal
    'w' mode truncates the file to zero bytes immediately when open() is called — before any write. 'a' mode preserves existing content and positions the write cursor at the end. The key distinction is that truncation happens at open time, not at write time. So if you open in 'w' and then crash, the file is already empty.
  • QWhy should you use a 'with' statement when working with files in Python, and what specifically happens under the hood when the 'with' block exits — even if an exception is raised?Mid-levelReveal
    The with statement calls file.__enter__() on entry and file.__exit__() on exit. __exit__ calls file.close(), which flushes buffers and releases the OS file descriptor. Even if an exception occurs, __exit__ is called because with acts as a finally block. This prevents resource leaks and ensures data is flushed to disk. Without with, you rely on the developer remembering to call close() in all code paths — which is error-prone.
  • QA production script reads a config file that sometimes doesn't exist on first boot. A junior dev wrote 'if os.path.exists(path): open(path)' to handle this. What is wrong with that approach, and how would you rewrite it correctly?SeniorReveal
    The problem is a TOCTOU (time of check, time of use) race condition. Between os.path.exists() and open(), another process could delete or rename the file. The correct approach is EAFP: try: open(path) except FileNotFoundError: use_defaults(). This is atomic, faster (no extra stat call), and race-condition-free.

Frequently Asked Questions

What is the difference between read(), readline() and readlines() in Python?

read() loads the entire file into a single string — convenient for small files but dangerous for large ones. readline() fetches exactly one line and moves the cursor forward, useful when you need manual control. readlines() returns a list of all lines as strings. In practice, iterating the file object directly ('for line in file:') is better than all three for large files because it reads one line at a time without loading everything into memory.

How do I check if a file exists before opening it in Python?

The Pythonic way is not to check first — just try to open it and catch FileNotFoundError. This avoids a race condition where the file could be deleted between your check and your open(). Use 'try: open(path) except FileNotFoundError: handle_it()'. If you only need to check without opening, Path('yourfile.txt').exists() from pathlib is the cleanest syntax.

Why do I get a UnicodeDecodeError when opening a file that looks like a text file?

This happens when the file's actual encoding doesn't match what Python is using to decode it. Python defaults to the platform encoding (often Windows-1252 on Windows, UTF-8 elsewhere). The fix is to specify the encoding explicitly: open('file.txt', 'r', encoding='utf-8'). If you don't know the file's encoding, install the 'chardet' library and use it to detect the encoding before opening. For files that might contain arbitrary bytes, open in binary mode ('rb') and handle decoding yourself.

Is 'a+' mode safe for reading and appending to the same file?

'a+' opens a file for both reading and appending. The caveat: the file position is at the end for writing, and reading also starts at the end unless you seek. If you need to read first then append, use 'r+' and seek. 'a+' is rarely the right choice — prefer separate open calls for reading and appending.

How do I handle binary files like images with Python?

Open them in binary mode with 'rb' for reading or 'wb' for writing. In binary mode, read returns bytes objects, not strings. Use shutil.copy() for simple copies. For image processing, libraries like Pillow handle the binary format internally.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →Reading and Writing Files in Python
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged