Run-Length Encoding — 255-Run Limit Bug in Fax
Fax images corrupted by white bands every 256 pixels because RLE's byte-sized count maxed at 255.
- RLE replaces consecutive identical symbols with a (count, symbol) pair.
- Encode and decode both run in O(n) time, O(r) space (r = number of runs).
- Best-case compression: ~120x for fax lines with long white runs.
- Worst-case expansion: 2x for alternating data like 'ABAB'.
- Production insight: RLE is never used alone in modern compressors — it's a building block in bzip2 (BWT+MTF+RLE+Huffman).
- Biggest mistake: applying RLE to random or already-compressed data — you'll expand the file, not shrink it.
Run-Length Encoding is the simplest compression idea: instead of storing AAABBBBBCCCC, store 3A5B4C. Consecutive runs of the same character are replaced by a count and the character. It works brilliantly on data with long runs (fax images, simple graphics) and fails badly on random data (English text) where it can make things larger.
Run-Length Encoding is the first compression algorithm most people encounter, and it teaches the fundamental tension in all compression: the algorithm's effectiveness depends entirely on the structure of the data. For a fax transmission of a printed page — mostly white pixels with sparse black text — RLE achieves 50:1 compression. For a natural photograph, RLE can make the file larger.
This data-dependency is not a bug; it is the nature of compression. Shannon's source coding theorem says you can only compress data if it contains redundancy. RLE exploits one specific type of redundancy: runs of identical symbols. Understanding when your data has this structure — and when it doesn't — is the practical skill.
Basic RLE Implementation
Here's the standard RLE encoder and decoder for strings. The encoder scans left to right, counting consecutive identical characters. When the character changes, it appends the (count, char) tuple. The decoder uses multiplication to expand each tuple.
Notice the edge case handling: empty input returns an empty list, and the final run is appended after the loop. You'll see that the compression ratio varies wildly — the fax line (720 characters, mostly long runs) compresses 120x, while alternating 'ABAB' expands by a factor of 2.
Binary RLE and Escape Codes
For raw byte data, we can pack run-length pairs as [count][byte]. The count is a single byte (0–255), limiting run length to 255. To handle longer runs, you split them into multiple pairs. This format is used in early image formats like PCX and in some raw bitmap encodings.
The example shows a solid block of 100 bytes of 0xFF, followed by 200 of 0x00, then 50 of 0x80. The byte-level RLE compresses this from 350 bytes to just 6 — a 58x ratio. But if the data is random (short runs), each byte pair uses 2 bytes, so random data doubles in size.
RLE in Image Formats: BMP, PCX, and TIFF
RLE was widely adopted in early image formats. BMP supports 4-bit and 8-bit RLE encoding (BI_RLE4, BI_RLE8) for indexed color images. PCX used RLE per scanline: if the two high bits of a byte are set, the lower 6 bits are a repeat count, otherwise the byte is a literal. TIFF has an RLE option in its compression tag.
In practice, these formats are legacy. Modern versions of BMP rarely use RLE because photographs produce short runs. However, for pixel art, icons, or diagrams with large uniform areas, RLE remains effective.
- A solid blue sky in a photo: runs are long, RLE shrinks data.
- A grass texture with alternating green shades: runs are short, RLE expands.
- BMP RLE writes a series of commands: (repeat, color) or (literal count, literals...).
- PCX RLE distinguishes between literal and run bytes by checking the top bits.
RLE as a Building Block in bzip2
The bzip2 compression pipeline uses RLE in two places: 1. Pre-RLE: a first pass that replaces runs of 4–255 identical bytes with a special marker and a count. This is applied to the raw input to reduce space before the Burrows-Wheeler Transform (BWT). 2. Post-MTF RLE: after BWT and Move-to-Front (MTF) coding, the output contains many zero runs. RLE encodes these zero runs (run-length encoding of zeros) before final Huffman coding.
Together with BWT (which groups similar symbols together) and MTF (which outputs small numbers for frequent symbols), RLE turns a structured but not yet compressed intermediate into a highly compressible sequence. This is why bzip2 can beat gzip on certain data (e.g., text files with repetitive content).
Performance Considerations and Edge Cases
RLE is O(n) in time and O(r) in space (r = number of runs). Decode is O(n) in the uncompressed size. The algorithm is trivial, but practical implementations must handle:
- Empty input: return empty output.
- Single character: the loop must still append the final run.
- Maximum run length: if using fixed-size count field, cap and split.
- Non-printable symbols: binary data must encode counts and bytes correctly, not confuse them with delimiters.
- Large datasets: RLE is not a bottleneck — even on 1 GB files, a naive Python implementation takes a few seconds.
Memory wise: the encoder stores the list of pairs, which can be large if the number of runs is high (e.g., random data). For worst-case (every character differs), the compressed output is 2x the input size. That's a 100% overhead. Always check the ratio before committing to RLE as a delivery format.
The 255-Run Limit That Corrupted a Fax Archive
- Never assume maximum run length fits in a single byte — real-world data always surprises you.
- Always test edge cases: empty input, single character, maximum runs, alternating patterns.
- Document the count field size in the format spec — every implementer needs to know.
Key takeaways
Common mistakes to avoid
4 patternsUsing RLE on random or already-compressed data
Not handling runs longer than the count field can store
Forgetting to append the last run after the encoding loop
encoded.append((count, data[-1])) after the loop, or use a state machine that appends on character change and at end.Using RLE for natural language text
Interview Questions on This Topic
What type of data does RLE compress well, and what data does it expand?
Frequently Asked Questions
That's Data Compression. Mark it forged?
3 min read · try the examples if you haven't