Python File I/O — Descriptor Leaks Without `with`
Missing with leaked 2 fds per rotation cycle — after 42 hours the server hit the OS 1,024 limit and crashed.
20+ years shipping production Python across data and backend systems. Everything here is grounded in real deployments.
- Python file I/O uses open() with a mode string: 'r' (read), 'w' (write/destroy existing content), 'a' (append), 'r+' (read+write without truncation)
- Always use the 'with' statement — it guarantees file closure even when exceptions fire, preventing OS-level file descriptor leaks that silently degrade production systems
- Iterate line-by-line with 'for line in file' for O(1) memory usage — never use read() or readlines() on files larger than your available RAM
- Mode 'w' destroys existing content the instant open() is called with no confirmation and no recovery — use 'a' for appending unless you explicitly need a clean slate
- writelines() does NOT add newlines between items — you must include '\n' in each string yourself or all output merges into one unreadable line
- Biggest production mistake: using 'w' mode when you meant 'a' — months of log data vanishes in a single open() call with no error, no warning, and no undo
Think of a file on your computer like a physical notebook. Opening a file in Python is like picking up that notebook from the shelf. Reading it is like flipping through the pages. Writing to it is like picking up a pen and adding content. Closing it is like putting the notebook back on the shelf so nothing is lost and someone else can use it. The file mode you choose is like choosing what kind of pen you use — one mode adds to what's already written, another erases the entire notebook before you start, and another lets you read without touching anything. Python's 'with' statement is like having an assistant who always puts the notebook back on the shelf for you, even if you get distracted or something goes wrong mid-way through.
Every real-world application eventually needs to talk to the file system. Whether you're saving user preferences, processing a CSV of sales data, writing application logs, reading a configuration file on startup, or building a data pipeline — file I/O is the plumbing that holds software together. Skip this skill and you're building programs that forget everything the moment they stop running.
Before Python's modern file handling existed, developers had to manually track whether files were open, remember to close them after every operation, and write error-handling boilerplate just to read a single line of text safely. One missed close() call could lock a file for the entire session, corrupt data, or exhaust the operating system's limit on open file descriptors — often after hours of flawless operation, which is the worst possible time to discover the bug.
Python solved this with the context manager pattern — the 'with' statement — which handles cleanup automatically regardless of what goes wrong inside the block. It is not a stylistic preference. It is the difference between code that works in a demo environment and code that survives a production server running for weeks.
By the end of this guide you will know how to confidently open, read, write, and append to files. You will understand exactly which file mode to reach for in each situation, why the 'with' statement is non-negotiable in any code you ship, how to process files that are larger than your available RAM, and how to avoid the silent mistakes that corrupt data and confuse experienced developers who should have known better.
Why Python File I/O Demands a Context Manager
Python file I/O is the mechanism for reading from and writing to files on disk via built-in functions like open(), which returns a file object. The core mechanic is that the file object holds a system-level file descriptor — a limited OS resource. Without explicit closure, the descriptor remains open until the object is garbage-collected, which is non-deterministic and can exhaust the process's file descriptor limit (often 1024 per process on Linux).
When you call open(), the OS allocates a file descriptor from a per-process pool. Each read() or write() operation advances an internal cursor. The critical property is that file objects are buffered by default — data may not be flushed to disk until the buffer fills or the file is closed. In CPython, reference counting usually triggers immediate cleanup, but relying on it is fragile: exceptions, early returns, or circular references can delay or prevent closure.
Use the with statement to guarantee deterministic cleanup, even if an exception occurs. In production systems, failing to do so causes 'Too many open files' errors, silent data loss from unwritten buffers, and hard-to-debug resource leaks. The rule is simple: any open() call must be paired with a with block or an explicit close() in a finally clause.
open() must be inside a with statement — no exceptions, even for quick scripts.open() as a liability.close() is error-prone.close() — always flush or close explicitly.File Modes Explained — Picking the Right Tool Before You Touch the File
Every time you open a file in Python, you are making a contract with the operating system. That contract is defined by the mode string you pass to open(). Get it wrong and you will either overwrite data you meant to keep, get a FileNotFoundError you did not expect, or silently append garbage to a file you thought was clean. There is no undo button and no confirmation prompt.
The four modes you will use in 90% of real work are: 'r' (read only — the file must already exist and you cannot modify it), 'w' (write — creates the file if it does not exist, but destroys all existing content if it does, immediately, with no warning), 'a' (append — adds new content to the end without touching what is already there, and creates the file if it does not exist), and 'r+' (read and write — the file must exist, the cursor starts at position 0, and writing does not truncate existing content — it overwrites bytes at the current cursor position).
There is also a binary variant for each mode: 'rb', 'wb', 'ab', 'r+b'. Use binary mode when working with images, PDFs, audio files, pickled Python objects, or any data that is not human-readable text. Text mode ('r', 'w') automatically handles newline translation across operating systems — on Windows, ' ' in your Python string becomes '\r ' on disk. Binary mode bypasses all of that, which is exactly what you need when you are working with raw bytes that must be preserved exactly as-is.
There is also 'x' mode — exclusive creation — which creates a new file but raises FileExistsError if the file already exists. This is useful when you need to guarantee you are creating a fresh file and want the operation to fail rather than silently overwrite something. It is the safe alternative to 'w' in situations where overwriting would be a bug rather than an intended operation.
The single most destructive mistake in Python file I/O is opening a file in 'w' mode when you meant 'a'. Your log file from the last three months? Gone in one open() call. Understand the modes before you write a single open() call — everything else in file I/O builds on top of this foundation.
open() is called — before you write a single character. Python does not ask for confirmation, does not back up the existing content, and does not raise any exception. If you opened the wrong file or used the wrong mode, the data is gone. The only safe reflex: audit every open() call before you ship it. If the intent is to add new content without destroying old content — logs, audit trails, accumulated data — the mode must be 'a', not 'w'. Reserve 'w' exclusively for situations where you deliberately want a clean file: regenerating a report from scratch, creating a new configuration on first run, or rewriting a file whose previous state is no longer relevant.open() is called — there is no recovery path and no warning. A single wrong mode string in a log rotation script, a daily report generator, or an API response writer can silently wipe data that took months to accumulate.open() call, immediately ask yourself 'do I want to destroy existing content on every run?' If no, change 'w' to 'a'. If yes, add a comment documenting why 'w' is intentional. That comment serves as a speed bump for the next developer who sees it and instinctively wonders if it is a bug.The 'with' Statement — Why Every Production File Open Uses It
Here is a scenario that breaks real applications: your code opens a file, starts processing its contents, and then raises an unexpected exception halfway through — maybe a network timeout, maybe a malformed record, maybe a KeyError on a dictionary lookup. If you opened the file with a plain open() call and relied on a manual file.close() at the end of the function, that close() never runs. The file handle stays open, the OS-level resource stays allocated, and the process accumulates open file descriptors with every subsequent error.
On Linux and macOS, the default limit for open file descriptors per process is 1,024. That number sounds large until you have a web server handling 200 requests per minute, each of which opens a file without properly closing it on the error path. At that rate, you hit the limit in under ten minutes and every subsequent file operation in the entire process starts failing.
The 'with' statement solves this with the context manager protocol. When you enter a 'with' block, Python calls the object's __enter__ method. When the block exits — regardless of whether it exits normally, through a return statement, or because an exception was raised — Python calls the object's __exit__ method, which closes the file handle. Guaranteed. Every time.
This is not a stylistic nicety or a PEP 8 preference. It is the mechanism that makes the difference between code that works in development and code that survives weeks of continuous operation in production. The production incident at the top of this guide happened because one engineer replaced a 'with' statement with a bare open() and a manual close(), the close() was on the wrong side of an early return, and the server degraded silently over 42 hours before crashing hard.
open() raises an exception. The files close in reverse order of opening — dst closes first, then src — which is the safe order for copy and merge operations.open() without 'with' leaks file descriptors on every early return and every unhandled exception — not just on catastrophic failures. A function that returns early when input is invalid, skips close() on that path, and gets called a thousand times per hour will exhaust the OS file descriptor limit in under two hours with no error until the cliff.open() call in production code is a potential file descriptor leak. Leaks are silent until the OS hard limit is hit, at which point the entire process fails simultaneously with no grace period.open() calls and manual close()open() is a latent leak that only manifests under error conditions or high loadReading Strategies — read vs readline vs readlines vs Iteration
Python gives you four distinct ways to read file content, and picking the wrong one for your data size is one of the most common and most avoidable performance mistakes in Python scripts. The good news: the selection rule is simple once you understand what each method actually does.
file.read() pulls the entire file into a single string in memory. It is convenient for small configuration files, templates, and hash calculations, but loading a 2GB log file into a string will consume 2GB of RAM and potentially kill the process. Read() is correct for files you know are bounded in size — a few megabytes at most.
file.readlines() reads the entire file and returns a list of strings — one string per line, each with its trailing newline character included. It has the same total memory cost as read() because everything loads at once. The advantage is that you get random line access: all_lines[47] gives you line 47 without reading anything else. Use it when you genuinely need index-based line access. Rarely needed in practice.
file.readline() reads exactly one line and advances the cursor. Each call returns the next line. Useful for reading a header row separately, implementing state machines over file content, or when you need fine-grained control over which lines you process. Low overhead per call but verbose for processing entire files.
Iterating over the file object directly — for line in file — is the correct default for almost everything. Python buffers the file in OS-level chunks (typically 8KB) and yields one line at a time. Your memory usage stays flat regardless of whether the file is 10MB or 10GB. This is how you process large files without ever thinking about RAM.
For writing, file.write() takes a single string and writes it exactly as given — no automatic newlines added. file.writelines() takes an iterable of strings and writes each one in sequence — also with no automatic newlines added. The writelines() trap is subtle: if you forget to include ' ' in your strings, all your lines are concatenated into one continuous stream with no separators, and the output looks nothing like what you intended.
file.read() and file.readlines(), which load the entire file into RAM before you can process any of it. This answer demonstrates that you understand the distinction between loading data and streaming data, which is fundamental to building production data pipelines.file.readlines() are O(n) in memory where n is file size — a 5GB file needs 5GB of RAM before you process a single record. Line-by-line iteration is O(1) in memory regardless of file size.read() or readlines(). Use line-by-line iteration. This is not a premature optimization — it is the difference between a script that works on your laptop and one that works in a container with 512MB of memory.file.readlines() are convenience methods for small, bounded files only — they load the entire file into RAM before you process any of it.file.read() — simple, fast, appropriate when the file size is bounded and knownfile.readlines() — returns a list you can index into, but loads the entire file into RAMfile.readline() for the header, then switch to 'for line in file' for the bodyfile.writelines() — but include '\n' in each string explicitly; writelines() adds nothing between itemsReal-World Pattern — Building a Persistent Task Manager with File I/O
Reading and writing individual lines is one thing. Putting it together into a coherent application that correctly handles all the edge cases is what separates tutorial knowledge from practical production skill. Let's build a minimal persistent task manager — one that saves tasks to a file, loads them correctly on startup, marks them complete, and never loses data between runs or between failures.
This exact pattern appears throughout production codebases: shopping cart persistence, user preference files, application state caches, CI pipeline checkpoint files, and configuration management tools. The core loop is always the same — load state from disk at startup, modify in memory, write back to disk when state changes.
Two deliberate design decisions in this implementation are worth understanding. First, we use 'a' mode for adding tasks — it is non-destructive and safe to call concurrently or repeatedly. Second, we use 'w' mode when marking a task complete, because there is no efficient way to delete or modify a line in the middle of a file without rewriting it. The read-modify-write pattern — load all records into memory, change what needs changing, write everything back — is the standard approach for file-based persistence with small-to-medium datasets.
For production deployments where the file could be large or where a crash mid-write would be unacceptable, the safe extension of this pattern is to write to a temporary file first, verify the write succeeded, and then use os.replace() to atomically swap the temporary file into place. This guarantees you never end up with a half-written, corrupted file — the swap is atomic at the OS level.
os.replace() is atomic — it is guaranteed to be either the old version or the new version, never a partial write. This is not over-engineering; it is the standard practice for any file that contains data you cannot afford to lose.os.replace() pattern adds crash safety with three lines of additional code. Write to a .tmp file. If that succeeds, call os.replace(). If the process dies before os.replace() runs, the target file is untouched — you still have the old data. If it dies after os.replace() runs, the target has the new data. There is no window where the file is partially written. For any file you care about in production, this is the correct implementation.os.replace() for an atomic swap. This eliminates the risk of half-written corrupt files from process crashes during the write step.Buffering: The Silent Performance Killer in Production File Writes
Most tutorials treat write() like it hits the disk instantly. That's a lie. Python buffers writes in memory and flushes them in chunks. This is great for batch throughput — terrible when you need durability. If your process crashes between buffer flushes, that data is gone. Gone. The default buffer size is 8192 bytes (8KB) for binary files, line-buffered for text files. That means a single write('hello') might sit in userspace memory for seconds before the OS decides to page it out. You can control this with the buffering parameter in open(), but don't just set it to 0 for every file — that tanks performance because every write becomes a system call. The real trick: flush() after critical writes, or use fsync() if you need OS-level guarantees. Tradeoffs everywhere. Know your failure domain before choosing.
write() call persists data to disk. Always flush() before a long computation that might crash, or use buffering=0 for mission-critical audit trails — but measure the performance cost first.Encoding Errors Will Corrupt Your Data — Handle Them or Get Paged at 3 AM
Opening a UTF-8 file with default encoding is playing Russian roulette with your data. On Linux, most files are UTF-8 — but eventually someone will pipe a Latin-1 log or a Windows-1252 document into your pipeline. Python defaults to 'utf-8' in Python 3, but it throws UnicodeDecodeError on bytes it can't decode. Your process dies. The file is half-read. Production goes down. The fix: always specify an error handler. errors='replace' swaps unknown bytes with the Unicode replacement character (U+FFFD), preserving the rest of your data. errors='surrogateescape' saves the raw bytes so you can reconstruct them later — useful if you're copying files without caring about content. My rule: use errors='replace' for log parsing, errors='strict' (default) for data you can validate, and errors='surrogateescape' for binary data that happens to be text. Never leave encoding to chance.
open() unless you want your file process to explode on corrupt input.File Locking — Why Multiple Processes Writing the Same File Is a Disaster
Python's open() does not lock files. Two processes can write to the same file simultaneously, and you'll get interleaved lines. Log files become garbage. Configuration files overwrite each other. If you're building a multi-process system — and you are, whether you know it or not — you need explicit file locking. The fcntl module gives you flock() on Unix. It's advisory, so all writers must cooperate. Windows uses msvcrt.locking(), which is mandatory. But here's the kicker: not all filesystems support flock(), and on NFS, it's a coin flip. The real-world pattern is a lock file: a separate file whose mere existence signals 'busy'. Create it with os.open() and O_CREAT | O_EXCL for atomic creation. If the file exists, your process waits or fails. Clean up the lock file with a try/finally so it doesn't orphan. Locks are boring. But they're the difference between a system that works and one that silently corrupts data.
flock().Check File Properties Before You Touch Them — Avoid Silent Failures
You don't open a connection without checking credentials. Same rule applies to files. Production code must verify existence, size, and permissions before reading or writing. Otherwise, you get cryptic stack traces at 2 AM.
os.path and pathlib give you the tools. tells you if the file is there. os.path.exists() checks if it's empty — a zero-byte file will break parsers silently. os.path.getsize() validates read/write permissions before you commit.os.access()
Never assume the file is there just because your config says so. Cron jobs delete logs. Mounts fail. Permissions change. Validate upfront or get paged for a file-not-found error that should have been caught at startup.
os.path.islink() first if you support symlinks.Pick the Right File Mode — Or Watch Your Data Get Truncated
File modes aren't optional decoration. They define the contract between your code and the OS. Mix them up and you overwrite production logs, corrupt binary files, or crash on Windows line endings.
'r' for reading text. 'rb' for reading bytes — use this for images, archives, anything non-UTF-8. 'w' truncates the file on open — if that file is critical, you just lost it. 'a' appends without destroying existing data. 'x' fails if the file exists — perfect for lock files or run-once logs.
Don't default to 'w' because it's easy. Default to 'a' for logs, 'x' for safety, and 'rb' for any binary payload. Your future self — and your on-call rotation — will thank you.
'x' for anything that should only be written once — like PID files or initial config. Avoids accidental overwrites.Tips and Tricks — Avoid Common Pitfalls in File I/O
Most file I/O bugs stem from forgetting that files are iterators, not lists. When you call read() on a large file, Python loads it entirely into memory — a fast OOM crash on a 10GB log file. Instead, iterate over the file object line by line: for line in file:. This streams data from disk, keeping memory constant. Another killer: mixing read calls on the same handle. After read() exhausts the cursor, readline() returns empty strings. Always rewind with file.seek(0) if you must reread. For writing, flush strategically. Flush guarantees data hits the OS buffer, but not the disk — call os.fsync( for durability. When joining paths, never concatenate strings; use file.fileno())pathlib.Path for cross-platform correctness. Finally, never ignore the return value of file.write(). It returns the number of bytes written; a short write means partial output you'll debug at 3 AM.
read() for large files; always check write returns and use pathlib.Don’t Re-Invent the Snake — Use Python’s Built-in File Utilities
Python ships with battle-tested file utilities that developers blindly rewrite. shutil.copy2(src, dst) preserves metadata in one call — not five lines of open/read/write. Need to walk directories? os.walk() yields (root, dirs, files) tuples; don't build recursive finders from scratch. Temporary files require the tempfile module for atomic cleanup: tempfile.NamedTemporaryFile() auto-deletes when closed, preventing orphaned temp data. For config files, use json.dump() with indent=2 for readability or configparser for INI formats — never hand-parse. The filecmp module compares files byte-by-byte or shallowly without opening them yourself. Most critically, use io.StringIO and io.BytesIO for in-memory file-like objects. This lets you test I/O logic without touching disk — your unit tests stay fast and deterministic. Each of these tools solves a real production problem that reimplementing will get wrong.
Don’t Re-Invent the Snake
Python's standard library is packed with file utilities that handle edge cases you haven't imagined. Instead of writing brittle loops to read a configuration file or parse CSV data, reach for configparser, csv, or json. These modules are battle-tested, thread-safe, and encode best practices like automatic resource cleanup and error handling. Rolling your own parser for a structured file format is not just wasted effort—it's a reliability risk. For text processing, pathlib offers methods like read_text() and write_text() that eliminate common encoding slip-ups. When dealing with compressed logs, gzip.open() or bz2.open() work transparently; don't compress manually. The rule is simple: if Python ships with a module for your file format, use it. Your code becomes shorter, faster, and harder for bugs to find.
You’re a File Wizard Harry!
Mastering file I/O means wielding Python's magic tools—seek(), tell(), and memory-mapped files—to control exactly how and where data is read or written. seek() lets you jump to any byte position in a file, turning it into a random-access database. Use tell() to bookmark positions for later resumption. For massive files, mmap exposes file contents as a mutable byte array, enabling in-place edits without copying whole files—ideal for record-based binary formats or writing a custom index. This wizardry reduces memory pressure and speeds up operations like searching sorted logs or patching headers. But caution: one wrong seek can corrupt structured data. Always test with known offsets and use context managers to ensure the file pointer resets. With these spells, you transcend basic line-by-line reading.
Production Web Server Crashes After Exhausting OS File Descriptor Limit
open() call without a 'with' statement. Each rotation cycle — triggered every five minutes by a cron job — opened the old log file for reading and a new log file for writing. The manual close() calls were placed after a conditional return statement that fired when the old log file was detected as empty. When the rotation correctly identified an empty log, it returned early and skipped both close() calls. Two file handles leaked every five minutes. After approximately 500 rotation cycles — roughly 42 hours of cumulative uptime — the process hit the OS hard limit of 1,024 open file descriptors and all subsequent file operations failed simultaneously. The server had been leaking silently the entire time with no warning.open() calls in the log rotation module with 'with' statements. The context manager guarantees __exit__ is called and the file is closed regardless of whether the function returns early, raises an exception, or completes normally. Also raised the process ulimit to 4,096 as a buffer against future leaks being caught before they cascade: ulimit -n 4096. Added a Prometheus gauge monitoring the open file descriptor count at the process level using os.sysconf('SC_OPEN_MAX'), with an alert threshold at 80% of the limit so the team gets warning long before the next hard failure.- Every
open()call without a 'with' statement is a potential file descriptor leak — even whenclose()exists, early returns and exceptions can bypass it entirely, and the OS will not warn you until the hard limit is hit - File descriptor leaks are silent by design — the OS does not throttle or warn you as you approach the limit; it simply fails all at once when you cross it, at which point every file operation in the process fails simultaneously
- Monitor open file descriptors in production as a first-class metric: ls -la /proc/<pid>/fd/ | wc -l or lsof -p <pid> | wc -l gives you a count; a count that grows monotonically over hours is a leak
- Raising ulimits proactively for file-heavy services buys time for alerting to catch leaks before they become incidents, but it is not a substitute for fixing the leak
open() call with a 'with' statement. The pattern is almost always a code path that returns early or raises an exception before reaching the manual close() call.open() is called — there is no confirmation and no recovery. Search your codebase for open(filepath, 'w') or open(filepath, 'w+') in any context where you intend to preserve existing content. Change those to 'a' for append-only access.file.read() or file.readlines(), both of which load the entire file into RAM before you can process any of it. Switch to line-by-line iteration: for line in file. Python reads the file in OS-level buffer chunks and yields one line at a time. Memory usage stays constant at a few kilobytes regardless of whether the file is 50MB or 50GB.ls -la /proc/<pid>/fd/ | wc -llsof -p <pid> | sort -k9 | head -50open() calls with 'with' statements in the module that opens those files. Verify the fix by watching the fd count over several minutes: watch -n 5 'ls /proc/<pid>/fd | wc -l'Key takeaways
open() call in production code is a potential file descriptor leak that accumulates silently until the OS hard limit is hit and everything fails simultaneously.open()read() or readlines() when the file size is bounded, known, and genuinely small.Common mistakes to avoid
5 patternsUsing 'w' mode instead of 'a' mode when the intent is to add data to an existing file
open() call with 'w' mode and ask explicitly: 'Do I want to destroy all existing content on every run?' If no, change 'w' to 'a'. A useful codebase-wide check: grep -rn "open(.*'w'" to find all write-mode calls and verify each one is intentionally destructive.Calling file.read() or file.readlines() on a file that can grow beyond available RAM
read() or readlines() when the file size is bounded, known, and small.Forgetting to strip newline characters after reading lines, causing silent string comparison failures
raw_line.strip()'. Make it reflexive rather than something you remember case-by-case.Using bare open() without 'with' in any code path that runs in production
close() calls exist but are placed after conditional returns or exception handlers that skip them.open() call with 'with open(...) as f'. The context manager guarantees __exit__ is called on all exit paths — normal return, exception, and early return all trigger file closure. This is not optional refactoring; it is the minimum standard for production Python file code.Using writelines() without including '\n' in each string, expecting automatic line separation
writelines(): summary_lines = ['First line\n', 'Second line\n']. Alternatively, use a list comprehension that adds '\n': file.writelines(line + '\n' for line in data). Never rely on writelines() to add separators — it does not and will not.Interview Questions on This Topic
What is the difference between opening a file in 'r+' mode and 'w+' mode, and when would you choose one over the other?
open() call. Writing overwrites bytes starting at the current cursor position rather than erasing everything.
'w+' opens for both reading and writing but truncates the file to zero bytes on open — all existing content is destroyed immediately, exactly like 'w' mode. The file is created if it does not exist. You can read from it, but only content that you write in the current session is available to read back.
Choose 'r+' when you need to read an existing file and selectively update it — for example, reading a configuration file, modifying a specific field, and writing the updated value back while preserving surrounding content. The file must exist, and you get a FileNotFoundError as a safety net if it does not.
Choose 'w+' when you need a scratch space — you write data, then read it back before doing something with it, such as generating a report and verifying it before sending it to a downstream system. The file starts empty every time, which is intentional.Frequently Asked Questions
20+ years shipping production Python across data and backend systems. Everything here is grounded in real deployments.
That's File Handling. Mark it forged?
14 min read · try the examples if you haven't