Python Iterator Exhaustion — Silent Data Drop in ETL
Generator exhaustion silently dropped 50% of ETL records.
20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.
- An iterable has __iter__() that returns an iterator
- An iterator has __next__() that yields items and raises StopIteration when done
- Every Python for-loop calls iter() once then next() in a loop until StopIteration
- Lists, tuples, strings are iterables; file objects are iterators
- Generators are iterator factories written with yield — same protocol, less code
- Reusing an iterator causes silent empty loops — always pass the iterable
Imagine a deck of playing cards. The deck itself is the iterable — it's the thing that holds all the cards. The dealer who picks up one card at a time, remembers where they left off, and hands each card to you one by one? That's the iterator. You never grab the whole deck at once — you get one card, use it, then ask for the next. Python's for-loops work exactly this way, quietly using an invisible dealer every single time.
Every Python developer uses for-loops from day one, but almost nobody stops to ask: how does Python actually know what to do next on each loop cycle? The answer lives inside a two-method protocol — __iter__ and __next__ — and once you understand it, you'll see it everywhere: in file reading, database cursors, API pagination, and streaming data pipelines. This isn't just academic knowledge; it's the engine under the hood of the language itself.
The problem this solves is memory and control. If Python loaded every item from a collection into memory before you could loop over it, working with a 10-million-row CSV file or an infinite sequence of sensor readings would be impossible. Iterators let you process one item at a time, on demand, without ever needing to know how many items exist in total. This lazy evaluation is what makes Python practical for real data-engineering work.
By the end of this article you'll be able to explain exactly what happens when Python executes a for-loop, write your own custom iterator class from scratch, spot the difference between an iterable and an iterator in a code review, and avoid the subtle bugs that trip up even experienced developers when they assume an iterator can be reused.
Why Iterator Exhaustion Is a Silent Data Bug
An iterator in Python is an object that implements and __iter__(), producing values one at a time and raising StopIteration when exhausted. An iterable is any object that can return an iterator via __next__()iter(). The core mechanic: iterators are stateful — they track position and can only be traversed once. This single-pass design is memory-efficient (O(1) space) but introduces a hidden failure mode when code assumes reusability.
In practice, passing an iterator to multiple consumers (e.g., two list() calls, or a for loop followed by sum()) silently yields empty results after the first consumption. The iterator doesn't reset; it stays exhausted. This contrasts with iterables like lists, which produce fresh iterators each time. The distinction matters because many built-in functions (map, filter, zip) and generators return iterators, not iterables.
Use iterators when processing large or infinite streams where memory is constrained. But never assume an iterator can be reused. In ETL pipelines, this mistake drops entire datasets without errors — the second consumer simply sees zero rows. Always convert to a concrete collection (list, tuple) if you need multiple passes, or restructure to a single-pass pattern.
iter() on an iterator returns itself — confirming it's already exhausted.The Two-Protocol System: Iterable vs Iterator
Python draws a firm line between two roles. An iterable is any object that knows how to produce an iterator — it has an __iter__ method that returns one. A list, a string, a tuple, a dict — all iterables. An iterator is the stateful worker that actually does the traversal. It has both __iter__ (which just returns itself) and __next__ (which delivers the next item or raises StopIteration when it's done).
This separation exists for a good reason: you want to be able to loop over the same list a hundred times without it 'running out'. The list (iterable) stays neutral. Each time you start a new loop, Python silently calls iter(your_list) to create a fresh iterator — a brand-new dealer for your deck of cards.
You can manually step through this process with Python's built-in iter() and next() functions, which is exactly what a for-loop does internally on every single iteration. Understanding this unlocks everything else in this article.
for item in collection into roughly: _iter = iter(collection), then a while-loop that calls next(_iter) and catches StopIteration to break. There is no magic — just these two protocol methods called repeatedly.iter() and next().iter() once per loop.Building a Custom Iterator — A Real-World File Chunker
Here's where things get genuinely useful. Let's say you're processing a large log file and you want to read it in fixed-size chunks rather than line by line or all at once. You can't do this elegantly with a plain list. This is the exact scenario custom iterators were made for.
To make an object an iterator, you implement two methods: __iter__ returns self (because the iterator is its own iterable), and __next__ returns the next value or raises StopIteration. That's the entire contract.
The power here is state. Your iterator class can carry any state it needs between calls to __next__ — a file handle, a counter, a buffer, a database cursor position. This is what separates a custom iterator from a simple function: it pauses between calls and picks up exactly where it left off, making it perfect for streaming, pagination, and lazy computation.
finally clause in the calling code, you're assuming callers will be responsible — they often aren't. Encapsulate cleanup inside __next__ where you raise StopIteration, or implement a __del__ method as a safety net.Custom Iterator Class Implementation — Step-by-Step Template
While the file chunker above is a practical example, you'll often need a generic blueprint for building your own custom iterators. The pattern is always the same, whether you're wrapping a database cursor, a paginated API, or a live data stream.
Step 1: Define the class and __init__ – Accept your data source and any configuration. Store everything you need to start fresh. Do not open external resources here yet — delay that to __iter__ to allow restartability.
Step 2: Implement __iter__ – This method returns the iterator object. For one-shot iterators, return self. If you want the iterator to be restartable (i.e., you can call iter() again and get a fresh start), reset all state here, re-open resources, and return self. Restartability is optional but powerful.
Step 3: Implement __next__ – This is the core. Check if there's a next item available. If yes, return it after advancing internal state. If no, raise StopIteration. Every code path must end with either a return or a StopIteration — no fall-through.
Step 4: Handle cleanup – Close files, connections, or release locks when StopIteration is raised. Alternatively, implement __del__ as a safety net, but don't rely on it solely because garbage collection timing is unpredictable.
Step 5: Test with edge cases – Empty data, single item, partial iteration (breaking out early), and multiple iterations (if restartable).
Generator Functions — The Shortcut Python Gives You
Writing a full iterator class is powerful, but verbose. Python gives you a shortcut: generator functions. Any function that contains a yield statement automatically becomes a factory for iterator objects called generators. Python handles all the __iter__ and __next__ plumbing for you.
Under the hood, calling a generator function doesn't execute the body at all — it returns a generator object. Each call to next() on that object resumes execution from the last yield, suspending again at the next one. This is exactly the same pause-and-resume behaviour as our custom iterator, but expressed in a fraction of the code.
The real-world sweet spot for generators is producing sequences that are either very large or computationally expensive — think paginated API responses, mathematical series, or streaming transformations. If your data source is 'pull-based' (you ask for the next item when you're ready), a generator is almost always the right tool.
itertools Quick Reference — Lazy Iterator Building Blocks
Python's itertools module is a collection of fast, memory-efficient iterator building blocks. Every function in itertools returns a lazy iterator — nothing is evaluated until you loop over it. This makes them ideal for chaining transformations without blowing up memory.
Here's a quick reference table of the most commonly used itertools functions. Use this as a cheat sheet during development:
| Function | Purpose | Example Usage |
|---|---|---|
count(start=0, step=1) | Infinite arithmetic progression | for i in itertools.count(10, 2): yields 10, 12, 14,... |
cycle(iterable) | Infinite repetition of an iterable | for c in itertools.cycle('AB'): yields A, B, A, B,... |
repeat(element, times=None) | Repeat a single value | itertools.repeat(3, 4) yields 3, 3, 3, 3 |
accumulate(iterable, func=operator.add) | Running total (or any binary function) | itertools.accumulate([1,2,3]) yields 1, 3, 6 |
chain(*iterables) | Treat multiple iterables as one | itertools.chain([1,2], [3,4]) yields 1, 2, 3, 4 |
compress(data, selectors) | Filter data using a selector iterable | itertools.compress('ABCD', [1,0,1,0]) yields A, C |
dropwhile(predicate, iterable) | Drop items while predicate is true, then yield all | itertools.dropwhile(lambda x: x<5, [1,4,6,2]) yields 6, 2 |
takewhile(predicate, iterable) | Yield items while predicate is true, stop on first false | itertools.takewhile(lambda x: x<5, [1,4,6,2]) yields 1, 4 |
filterfalse(predicate, iterable) | Yield items where predicate is false | itertools.filterfalse(lambda x: x%2, [1,2,3]) yields 2 |
groupby(iterable, key=None) | Consecutive keys and groups (sort first!) | for key, group in itertools.groupby('AAABBC'): yields groups A, B, C |
product(*iterables, repeat=1) | Cartesian product | itertools.product([0,1], repeat=2) yields (0,0), (0,1), (1,0), (1,1) |
permutations(iterable, r=None) | All r-length permutations | itertools.permutations('AB', 2) yields ('A','B'), ('B','A') |
combinations(iterable, r) | All r-length combinations (order doesn't matter) | itertools.combinations('AB', 2) yields ('A','B') |
When to use itertools? Any time you're writing a custom loop that involves skipping, grouping, or combining sequences. These functions are implemented in C and are significantly faster than equivalent pure-Python code.
itertools.chain or itertools.groupby can give you a 2x-5x speed improvement. Additionally, since they are lazy, memory stays constant regardless of input size.groupby, chain, and takewhile are often used to process streaming data without materializing intermediate lists. Combine them with generator functions to build complex lazy pipelines that handle gigabytes of data with minimal memory footprint.Lazy Evaluation — How Iterators Enable Streaming and Large Data Processing
The real superpower of iterators isn't just the protocol — it's that they evaluate values only when asked. This is lazy evaluation. Instead of building a whole list in memory, an iterator produces one element at a time. That means you can process data streams that would never fit in RAM: reading a 100 GB log file, iterating over an infinite mathematical sequence, or consuming a real-time sensor feed.
Python's standard library is full of lazy iterators: , map(), filter(), zip(), enumerate() (on sequences) — all return iterators. Even reversed() returns an iterable that produces numbers on demand, not a list of all numbers. This design is intentional: Python defaults to lazy unless you force it with range(), list(), or a comprehension with brackets.tuple()
Understanding lazy evaluation helps you design systems that are memory-efficient by default. If you find yourself calling on a generator just to pass it to a function, stop and ask: does that function truly need random access, or can it work with a stream? In many cases, the function itself can be refactored to iterate lazily.list()
- Lazy: only compute/load what the consumer asks for, one step at a time.
- Eager: compute/load everything up front, storing it all in memory.
- Python's built-in functions like map, filter, zip are lazy by default.
- Converting a lazy sequence to a list forces eager evaluation — use sparingly.
next() call only goes one level deep.Memory Efficiency Comparison: List vs Generator (Eager vs Lazy)
The single most important practical difference between lists and generators is memory consumption. For large datasets, a list stores all elements in memory simultaneously, while a generator produces each element on demand and discards it after use. This difference can mean the difference between a pipeline that runs on a laptop and one that crashes with MemoryError.
Below is a comparison table for a sequence of n integers (assuming Python 3.12 on a 64-bit system). Actual numbers vary by Python version and system, but the ratios hold.
| Number of Integers | List Memory (approx.) | Generator Memory (approx.) |
|---|---|---|
| 1,000 | ~28 KB | ~112 bytes (generator object) |
| 100,000 | ~2.8 MB | ~112 bytes |
| 10,000,000 | ~280 MB | ~112 bytes |
| 100,000,000 | ~2.8 GB | ~112 bytes |
As you can see, the list's memory grows linearly with n, whereas the generator object's size is constant because it doesn't store the data — only a reference to the generating function and current state.
The same applies to lazy iterator counterparts of list operations: map vs list comprehension, filter vs list comprehension, zip vs zip (already lazy). Converting a lazy iterable to a list with forces eager evaluation and consumes memory proportional to the entire sequence.list()
When to use a generator (lazy): When you only need to iterate once, and the sequence is large or expensive to compute.
When to use a list (eager): When you need random access (indexing), multiple passes over the data, or when the dataset is small enough that memory is not a concern.
The rule of thumb: If you can avoid storing the whole dataset in memory, do it. Start with a generator; only switch to a list if you run into a use case that genuinely requires it.
list() on a generator inside a loop or a function, not realizing the data size. This can cause OOM errors. Always profile memory usage before deploying. Use tools like memory_profiler or tracemalloc to detect accidental eager materialization.The Exhaustion Trap and the iter() Sentinel Form
Here's the behaviour that catches almost everyone at some point: iterators are one-shot. Once an iterator is exhausted, it stays exhausted. Calling iter() on an already-exhausted iterator just returns the same dead object — it does not reset. This is different from calling iter() on an iterable like a list, which creates a brand-new iterator.
This distinction has a practical consequence: if you pass an iterator (not an iterable) to two functions, the second one will silently get nothing. No error. Just an empty loop. These bugs are genuinely hard to track down.
Python also has a lesser-known second form of iter() — iter(callable, sentinel) — which wraps any zero-argument callable into an iterator that keeps calling it until the return value equals the sentinel. This is incredibly useful for reading data in fixed-size blocks, processing queue items, or any situation where you have a 'pull until done' data source.
items = iter(some_list) and pass items to two different functions, the second function will see an exhausted iterator and loop over nothing. Always pass the original iterable (the list/set/etc.) unless you deliberately want to share position state. When in doubt, check: does this object have __next__? If yes, it's an iterator — treat it as one-shot.When to Reach for an Iterator Instead of a List (And When to Run Away)
Here's where most devs get it wrong: they use iterators because they heard they're "memory-efficient" without asking if their data actually benefits from laziness. The decision isn't philosophical—it's about access patterns.
Use iterators when you're processing data one element at a time, never needing random access, and the dataset is larger than available RAM. Streaming CSV files, parsing network packets, generating sequences on the fly. These are iterator territory.
Avoid iterators when you need to index into the data, iterate over it multiple times, or modify elements in place during iteration. A list is not your enemy—it's the right tool when you need random access or multiple passes without re-initializing the iterator.
The rule is brutal but simple: if your data fits in memory and you access it more than once, use a list. If your data doesn't fit in memory or you only traverse it once, use an iterator. Don't cargo-cult memory efficiency.
list() just to index into it once, you've just paid the full memory cost with zero benefit. That's not lazy evaluation—that's a performance lie.Creating Different Types of Iterators: Yield Original, Transform, or Generate New Data
Not all iterators are created equal. Once you understand the iterator protocol, you need to know which flavor solves your problem. There are three distinct patterns you'll see in production—and mixing them up causes subtle bugs.
Yielding original data means your iterator doesn't modify the source—it just exposes it lazily. Think reading a file line by line without stripping or parsing. This is the simplest, safest pattern because the consumer controls transformation.
Transforming input data is where most pipeline code lives. Your iterator yields a modified version of each element—parsing raw bytes into structs, converting log timestamps, normalizing text. The key: every element transforms independently.
Generating new data means you're producing values that have no direct mapping to input. An iterator that produces Fibonacci numbers, a counter, or a sliding window over a stream. No external source, just logic.
Each pattern demands different testing strategies. Yielding original data is trivial to unit test. Transforming requires input/output pairs. Generating needs convergence checks to avoid infinite loops.
Coding Potentially Infinite Iterators — The Pattern That Breaks Beginners
Infinite iterators aren't a gimmick—they're how you model real-time data streams, retry loops, or sensor feeds. But infinite means you never get a StopIteration. If you write a for loop over one, you hang. Forever.
The pattern is simple: write an iterator that never raises StopIteration, and control consumption from the caller side. Use itertools.islice, takewhile, or explicit break conditions. The generator function with yield is your cleanest tool here.
The danger? Forgetting to add a break condition in a production loop. I've seen a batch processing job run for 14 hours because an infinite iterator fed into a for loop with no termination logic. The code looked correct until you traced the data flow.
Best practice: always wrap infinite iterators with a bounded consumer. Either pass a count limit or use itertools.takewhile with a predicate. If someone else uses your iterator, they won't expect it to hang—make the infinite nature explicit in the function name.
Stop Writing boilerplate — Subclass collections.abc.Iterator Instead
Every time you hand-roll __iter__ and __next__ on a class, you're writing code that Python already gave you. The collections.abc module ships with Iterator — an abstract base class that automatically provides __iter__ for you. You just implement __next__. That's it.
Why does this matter? Because __iter__ returning self is boilerplate you will forget, and when you forget it, your iterator won't work in for loops. Iterator.__subclasshook__ also catches classes that implement __next__ without explicit inheritance, so your code stays duck-typed friendly.
In production, this pattern matters when you're building streaming data pipelines, file parsers, or any component that processes chunks. Subclassing Iterator signals intent — every dev on your team immediately knows this class is meant to be exhausted. No guesswork, no hidden state bugs.
Why You Should Inherit From collections.abc.Iterator (And Not Just Wing It)
Hand-rolled iterators break silently in subtle ways. Your custom class has __next__ but some copy-paste rookie forgets __iter__? Now it fails in list() and for loops with a TypeError: 'YourClass' object is not iterable. That's a 30-minute debugging session where you stare at code that clearly has __next__ and scream at your monitor.
Subclassing Iterator eliminates that entire class of bug. The ABC provides __iter__ returning self, plus mixin methods like __length_hint__ that help CPython optimize memory in list() calls. Your iterator becomes a first-class citizen — it plays nice with itertools, multiprocessing, and any function that expects an iterable.
When you're building production data pipelines, this isn't about elegance — it's about consistency. Every team member follows the same contract. Your code review notes go from 'add __iter__' to 'approved', and you get back to shipping features.
Exhausted Iterator Causes Silent Data Drop in ETL Pipeline
list(my_generator()) internally, which exhausted the generator. The original generator object was passed downstream, but it was already exhausted—calling __next__ raised StopIteration immediately, so the processing loop never executed.- Generators are one-shot—never pass a generator object to more than one consumer.
- If you need to iterate twice, either call the generator function twice or materialize the data into a list.
- Always audit function signatures: if a function expects an iterable, it may exhaust the iterator. Prefer passing the factory (callable) over the instance.
gen = my_generator without parentheses, you have the function itself, not a generator. Add parentheses.file.seek(0) to reset the pointer, or re-open the file.print(type(obj).__name__, hasattr(obj, '__next__'))print(id(obj)) # Compare to original variable to confirm identityfor item in my_iterator: to for item in list(my_iterable): as a temporary fix, then refactor.Key takeaways
iter() once and then next() on every cycle until StopIteration is raisedCommon mistakes to avoid
3 patternsIterating an exhausted iterator and expecting results
iter() in a variable you plan to loop over more than once. Keep a reference to the original iterable (the list, tuple, or custom class) and call iter() fresh each time you need a new traversal. If you must reuse, convert to a list first: data = list(iterator).Forgetting to raise StopIteration in a custom __next__
__next__ recursively.__next__ must either return a value or raise StopIteration. Add a guard at the top: if self._position >= len(self._data): raise StopIteration before any return statement. Test with an empty collection.Treating a generator object as if it's reusable
yield) returns a new generator object each time it's called. If you need to iterate the same data twice, call the generator function again to get a fresh generator, or convert the first pass to a list with list(my_generator()). Do not store the generator object and reuse it.Interview Questions on This Topic
What's the difference between an iterable and an iterator in Python, and how does a for-loop use both of them internally?
__iter__(), which returns an iterator. An iterator implements both __iter__() (returning self) and __next__(), which returns the next element or raises StopIteration. A for-loop works by calling iter() on the target object to get an iterator, then repeatedly calling next() on that iterator, catching StopIteration to break. This means the for-loop creates a fresh iterator for iterables like lists, so you can loop multiple times. If you pass an iterator, it reuses the same exhausted one — that's why the second loop is silent.Frequently Asked Questions
20+ years shipping production Python across data and backend systems. Lessons pulled from things that broke in production.
That's Advanced Python. Mark it forged?
13 min read · try the examples if you haven't