Run-Length Encoding — 255-Run Limit Bug in Fax
Fax images corrupted by white bands every 256 pixels because RLE's byte-sized count maxed at 255.
20+ years shipping performance-critical code where algorithms decide the bill. Everything here is grounded in real deployments.
- RLE replaces consecutive identical symbols with a (count, symbol) pair.
- Encode and decode both run in O(n) time, O(r) space (r = number of runs).
- Best-case compression: ~120x for fax lines with long white runs.
- Worst-case expansion: 2x for alternating data like 'ABAB'.
- Production insight: RLE is never used alone in modern compressors — it's a building block in bzip2 (BWT+MTF+RLE+Huffman).
- Biggest mistake: applying RLE to random or already-compressed data — you'll expand the file, not shrink it.
Run-Length Encoding is the simplest compression idea: instead of storing AAABBBBBCCCC, store 3A5B4C. Consecutive runs of the same character are replaced by a count and the character. It works brilliantly on data with long runs (fax images, simple graphics) and fails badly on random data (English text) where it can make things larger.
Run-Length Encoding is the first compression algorithm most people encounter, and it teaches the fundamental tension in all compression: the algorithm's effectiveness depends entirely on the structure of the data. For a fax transmission of a printed page — mostly white pixels with sparse black text — RLE achieves 50:1 compression. For a natural photograph, RLE can make the file larger.
This data-dependency is not a bug; it is the nature of compression. Shannon's source coding theorem says you can only compress data if it contains redundancy. RLE exploits one specific type of redundancy: runs of identical symbols. Understanding when your data has this structure — and when it doesn't — is the practical skill.
Run-Length Encoding — The Compression That Fax Machines Trusted Too Much
Run-length encoding (RLE) is a lossless compression scheme that replaces consecutive identical values (runs) with a count and the value. For example, 'AAAAABBBCC' becomes '5A3B2C'. It's the simplest form of compression — O(n) time, O(1) extra space — and works well when data has long runs of repetition, like black-and-white fax images or simple bitmap graphics.
RLE's critical property is that it compresses in a single pass with no dictionary or entropy model. But it also has a hard limit baked into many implementations: runs are typically capped at 255 because the count is stored in a single byte. This means a run of 300 identical pixels must be split into two runs (255 + 45), adding overhead and creating a subtle bug surface when software assumes runs never exceed 255.
Use RLE when you need fast, predictable compression on data with long repeated sequences — think telemetry logs with many identical readings, or monochrome image transmission. It's not suitable for general text or binary data with high entropy. The real-world relevance? RLE is the foundation of fax protocol T.4 (Group 3), where the 255-run limit directly causes corrupted pages when a scanner produces runs longer than expected.
Basic RLE Implementation
Here's the standard RLE encoder and decoder for strings. The encoder scans left to right, counting consecutive identical characters. When the character changes, it appends the (count, char) tuple. The decoder uses multiplication to expand each tuple.
Notice the edge case handling: empty input returns an empty list, and the final run is appended after the loop. You'll see that the compression ratio varies wildly — the fax line (720 characters, mostly long runs) compresses 120x, while alternating 'ABAB' expands by a factor of 2.
Binary RLE and Escape Codes
For raw byte data, we can pack run-length pairs as [count][byte]. The count is a single byte (0–255), limiting run length to 255. To handle longer runs, you split them into multiple pairs. This format is used in early image formats like PCX and in some raw bitmap encodings.
The example shows a solid block of 100 bytes of 0xFF, followed by 200 of 0x00, then 50 of 0x80. The byte-level RLE compresses this from 350 bytes to just 6 — a 58x ratio. But if the data is random (short runs), each byte pair uses 2 bytes, so random data doubles in size.
RLE in Image Formats: BMP, PCX, and TIFF
RLE was widely adopted in early image formats. BMP supports 4-bit and 8-bit RLE encoding (BI_RLE4, BI_RLE8) for indexed color images. PCX used RLE per scanline: if the two high bits of a byte are set, the lower 6 bits are a repeat count, otherwise the byte is a literal. TIFF has an RLE option in its compression tag.
In practice, these formats are legacy. Modern versions of BMP rarely use RLE because photographs produce short runs. However, for pixel art, icons, or diagrams with large uniform areas, RLE remains effective.
- A solid blue sky in a photo: runs are long, RLE shrinks data.
- A grass texture with alternating green shades: runs are short, RLE expands.
- BMP RLE writes a series of commands: (repeat, color) or (literal count, literals...).
- PCX RLE distinguishes between literal and run bytes by checking the top bits.
RLE as a Building Block in bzip2
The bzip2 compression pipeline uses RLE in two places: 1. Pre-RLE: a first pass that replaces runs of 4–255 identical bytes with a special marker and a count. This is applied to the raw input to reduce space before the Burrows-Wheeler Transform (BWT). 2. Post-MTF RLE: after BWT and Move-to-Front (MTF) coding, the output contains many zero runs. RLE encodes these zero runs (run-length encoding of zeros) before final Huffman coding.
Together with BWT (which groups similar symbols together) and MTF (which outputs small numbers for frequent symbols), RLE turns a structured but not yet compressed intermediate into a highly compressible sequence. This is why bzip2 can beat gzip on certain data (e.g., text files with repetitive content).
Performance Considerations and Edge Cases
RLE is O(n) in time and O(r) in space (r = number of runs). Decode is O(n) in the uncompressed size. The algorithm is trivial, but practical implementations must handle:
- Empty input: return empty output.
- Single character: the loop must still append the final run.
- Maximum run length: if using fixed-size count field, cap and split.
- Non-printable symbols: binary data must encode counts and bytes correctly, not confuse them with delimiters.
- Large datasets: RLE is not a bottleneck — even on 1 GB files, a naive Python implementation takes a few seconds.
Memory wise: the encoder stores the list of pairs, which can be large if the number of runs is high (e.g., random data). For worst-case (every character differs), the compressed output is 2x the input size. That's a 100% overhead. Always check the ratio before committing to RLE as a delivery format.
Why Counting Runs Is Harder Than It Looks
Most tutorials hand you the iteration approach like it's the only option. It isn't. It's the beginner option. But you need to understand why it works before you start sticking counts between characters. The core idea is dead simple: walk the input, count consecutive identical characters, and emit the character plus its count. That's it. No recursion, no state machine, no magic.
The trap is assuming this works for every encoding scenario. It doesn't. Binary RLE (which we covered earlier) needs escape codes. Image formats use run-length pairs. But for plain text compression in memory, this is your workhorse. The runtime is O(n) because you visit each character exactly once. The space is O(n) for the output string, which in the worst case—no runs—doubles the input size. That's a feature, not a bug. If you're worried about worst-case blowup, you're using the wrong algorithm for your data.
When Character Counts Get Ugly — Multi-Digit Runs
Your first implementation probably handles 'aaa' → 'a3' just fine. But what about 100 consecutive 'x' characters? Your naive code dumps 'x100' into the output. That's three characters for one run. The decoder has no idea if 'x100' means character 'x', count 100, or character 'x1', count 0, followed by '0'. This ambiguity is why real RLE decoders use either fixed-width counts or escape codes.
Most production implementations cheat by using a delimiter or limiting run length to 255 (which fits in one byte). If you're building something that talks to legacy systems, like fax machines or BMP decoders, you must respect their count formats. Multi-digit counts explode the output size. A run of 999 takes three bytes instead of two. In a world where compression ratio matters, that's a sin.
The fix? Either store counts as bytes (0-255) and split long runs, or use a special marker to indicate that the following number is a count. The latter approach is what Binary RLE with escape codes does. Pick one, document it, and stick to it. Ambiguity is a bug.
The 255-Run Limit That Corrupted a Fax Archive
- Never assume maximum run length fits in a single byte — real-world data always surprises you.
- Always test edge cases: empty input, single character, maximum runs, alternating patterns.
- Document the count field size in the format spec — every implementer needs to know.
python -c "import rle; print(rle.decode_from_file('output.bin') == open('original.txt').read())"python -c "print([(c,s) for c,s in rle.encode(open('faulty.txt').read()) if c>255 or c<1])"Key takeaways
Common mistakes to avoid
4 patternsUsing RLE on random or already-compressed data
Not handling runs longer than the count field can store
Forgetting to append the last run after the encoding loop
encoded.append((count, data[-1])) after the loop, or use a state machine that appends on character change and at end.Using RLE for natural language text
Practice These on LeetCode
Interview Questions on This Topic
What type of data does RLE compress well, and what data does it expand?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Everything here is grounded in real deployments.
That's Data Compression. Mark it forged?
5 min read · try the examples if you haven't