Run-Length Encoding — 255-Run Limit Bug in Fax
- RLE replaces consecutive identical symbols with (count, symbol) — O(n) encode and decode.
- Compression ratio depends entirely on run length — 720 bytes of fax → 6 bytes (120x).
- Fails on alternating or random data — can expand the input by up to 2x.
- RLE replaces consecutive identical symbols with a (count, symbol) pair.
- Encode and decode both run in O(n) time, O(r) space (r = number of runs).
- Best-case compression: ~120x for fax lines with long white runs.
- Worst-case expansion: 2x for alternating data like 'ABAB'.
- Production insight: RLE is never used alone in modern compressors — it's a building block in bzip2 (BWT+MTF+RLE+Huffman).
- Biggest mistake: applying RLE to random or already-compressed data — you'll expand the file, not shrink it.
RLE Debug Cheat Sheet
Decoded text has extra or missing characters
python -c "import rle; print(rle.decode_from_file('output.bin') == open('original.txt').read())"python -c "print([(c,s) for c,s in rle.encode(open('faulty.txt').read()) if c>255 or c<1])"Compressed file is larger than original
python -c "data=open('input.bin','rb').read(); runs=1+sum(1 for i in range(1,len(data)) if data[i]!=data[i-1]); print('avg run:', len(data)/runs)"python -c "from itertools import groupby; data=open('input.bin','rb').read(); lens=[len(list(g)) for _,g in groupby(data)]; print('min',min(lens),'max',max(lens))"Production Incident
Production Debug GuideSymptom → Action guide for common RLE failures in production
Run-Length Encoding is the first compression algorithm most people encounter, and it teaches the fundamental tension in all compression: the algorithm's effectiveness depends entirely on the structure of the data. For a fax transmission of a printed page — mostly white pixels with sparse black text — RLE achieves 50:1 compression. For a natural photograph, RLE can make the file larger.
This data-dependency is not a bug; it is the nature of compression. Shannon's source coding theorem says you can only compress data if it contains redundancy. RLE exploits one specific type of redundancy: runs of identical symbols. Understanding when your data has this structure — and when it doesn't — is the practical skill.
Basic RLE Implementation
Here's the standard RLE encoder and decoder for strings. The encoder scans left to right, counting consecutive identical characters. When the character changes, it appends the (count, char) tuple. The decoder uses multiplication to expand each tuple.
Notice the edge case handling: empty input returns an empty list, and the final run is appended after the loop. You'll see that the compression ratio varies wildly — the fax line (720 characters, mostly long runs) compresses 120x, while alternating 'ABAB' expands by a factor of 2.
def rle_encode(data: str) -> list[tuple]: """Returns list of (count, char) pairs.""" if not data: return [] encoded = [] count = 1 for i in range(1, len(data)): if data[i] == data[i-1]: count += 1 else: encoded.append((count, data[i-1])) count = 1 encoded.append((count, data[-1])) return encoded def rle_decode(encoded: list[tuple]) -> str: return ''.join(ch * count for count, ch in encoded) def rle_ratio(data: str) -> float: encoded = rle_encode(data) # Each pair needs 1 byte count + 1 byte char = 2 bytes compressed_size = len(encoded) * 2 return len(data) / compressed_size # Good case: long runs fax_line = 'W' * 400 + 'B' * 20 + 'W' * 300 # typical fax line print(f'Fax: {len(fax_line)} → {len(rle_encode(fax_line))*2} bytes ({rle_ratio(fax_line):.1f}x)') # Bad case: alternating alternating = 'ABABABABABABABAB' print(f'Alternating: {len(alternating)} → {len(rle_encode(alternating))*2} bytes ({rle_ratio(alternating):.2f}x)') # Typical bitmap bitmap = 'W'*50+'B'*3+'W'*47 enc = rle_encode(bitmap) print(f'Encoded: {enc}')
Alternating: 16 → 32 bytes (0.50x)
Encoded: [(50, 'W'), (3, 'B'), (47, 'W')]
Binary RLE and Escape Codes
For raw byte data, we can pack run-length pairs as [count][byte]. The count is a single byte (0–255), limiting run length to 255. To handle longer runs, you split them into multiple pairs. This format is used in early image formats like PCX and in some raw bitmap encodings.
The example shows a solid block of 100 bytes of 0xFF, followed by 200 of 0x00, then 50 of 0x80. The byte-level RLE compresses this from 350 bytes to just 6 — a 58x ratio. But if the data is random (short runs), each byte pair uses 2 bytes, so random data doubles in size.
def rle_encode_bytes(data: bytes) -> bytes: """Compact binary RLE: [count][byte] pairs, count in 1 byte (max 255).""" if not data: return b'' result = bytearray() i = 0 while i < len(data): byte = data[i] count = 1 while i + count < len(data) and data[i+count] == byte and count < 255: count += 1 result.extend([count, byte]) i += count return bytes(result) raw = bytes([255]*100 + [0]*200 + [128]*50) compressed = rle_encode_bytes(raw) print(f'Raw: {len(raw)} bytes → Compressed: {len(compressed)} bytes ({len(raw)/len(compressed):.1f}x)')
RLE in Image Formats: BMP, PCX, and TIFF
RLE was widely adopted in early image formats. BMP supports 4-bit and 8-bit RLE encoding (BI_RLE4, BI_RLE8) for indexed color images. PCX used RLE per scanline: if the two high bits of a byte are set, the lower 6 bits are a repeat count, otherwise the byte is a literal. TIFF has an RLE option in its compression tag.
In practice, these formats are legacy. Modern versions of BMP rarely use RLE because photographs produce short runs. However, for pixel art, icons, or diagrams with large uniform areas, RLE remains effective.
- A solid blue sky in a photo: runs are long, RLE shrinks data.
- A grass texture with alternating green shades: runs are short, RLE expands.
- BMP RLE writes a series of commands: (repeat, color) or (literal count, literals...).
- PCX RLE distinguishes between literal and run bytes by checking the top bits.
RLE as a Building Block in bzip2
The bzip2 compression pipeline uses RLE in two places: 1. Pre-RLE: a first pass that replaces runs of 4–255 identical bytes with a special marker and a count. This is applied to the raw input to reduce space before the Burrows-Wheeler Transform (BWT). 2. Post-MTF RLE: after BWT and Move-to-Front (MTF) coding, the output contains many zero runs. RLE encodes these zero runs (run-length encoding of zeros) before final Huffman coding.
Together with BWT (which groups similar symbols together) and MTF (which outputs small numbers for frequent symbols), RLE turns a structured but not yet compressed intermediate into a highly compressible sequence. This is why bzip2 can beat gzip on certain data (e.g., text files with repetitive content).
# Simplified pre-RLE step from bzip2 (run-length encoding of repeated bytes) def bzip2_pre_rle(data: bytes) -> bytes: result = bytearray() i = 0 while i < len(data): byte = data[i] count = 1 while i + count < len(data) and data[i+count] == byte and count < 256: count += 1 if count >= 4: # Emit special symbol: (byte repeated 4 times) + (count-4 as byte) result.extend([byte]*4) result.append(count - 4) else: result.extend([byte]*count) i += count return bytes(result) # Example: 'AAAAA' -> 'AAAA' + 0x01 (since 5-4=1) print(bzip2_pre_rle(b'AAAAABBCCCDDDD'))
Performance Considerations and Edge Cases
RLE is O(n) in time and O(r) in space (r = number of runs). Decode is O(n) in the uncompressed size. The algorithm is trivial, but practical implementations must handle:
- Empty input: return empty output.
- Single character: the loop must still append the final run.
- Maximum run length: if using fixed-size count field, cap and split.
- Non-printable symbols: binary data must encode counts and bytes correctly, not confuse them with delimiters.
- Large datasets: RLE is not a bottleneck — even on 1 GB files, a naive Python implementation takes a few seconds.
Memory wise: the encoder stores the list of pairs, which can be large if the number of runs is high (e.g., random data). For worst-case (every character differs), the compressed output is 2x the input size. That's a 100% overhead. Always check the ratio before committing to RLE as a delivery format.
| Criterion | RLE | Huffman | LZ77 (Deflate) |
|---|---|---|---|
| Time complexity | O(n) encode, O(n) decode | O(n) with frequency table | O(n) typical, O(n^2) worst-case |
| Compression ratio (text) | 0.5x – 2x | 2x – 3x | 3x – 5x |
| Compression ratio (fax) | 50x – 120x | ~2x | ~10x |
| Memory usage | O(runs) ~ O(n) worst | O(alphabet size) | Sliding window + dictionary |
| Patent issues | None | None | Patented in early days, now free |
| Streaming support | Trivial | Needs Huffman tree first | Yes, with window |
| Friendly for teaching | Yes | Yes | Moderate |
🎯 Key Takeaways
- RLE replaces consecutive identical symbols with (count, symbol) — O(n) encode and decode.
- Compression ratio depends entirely on run length — 720 bytes of fax → 6 bytes (120x).
- Fails on alternating or random data — can expand the input by up to 2x.
- Used in fax (T4 standard), BMP, PCX, TIFF, and as a building block in bzip2.
- Simplest possible compressor — good teaching introduction to the compression/entropy tradeoff.
- Production pitfalls: fixed-size count fields, forgetting final run, applying RLE to wrong data.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat type of data does RLE compress well, and what data does it expand?JuniorReveal
- QImplement RLE encode and decode for a string.JuniorReveal
- QHow is RLE used in the bzip2 compression pipeline?Mid-levelReveal
- QWhat is the worst-case expansion ratio for RLE?SeniorReveal
Frequently Asked Questions
Does JPEG use RLE?
JPEG uses RLE for zero-coefficient runs in its zig-zag scan step (encoding runs of zeros after quantisation), combined with Huffman coding. It is not pure RLE — the DCT and quantisation stages do the heavy lifting.
Is RLE used in modern formats like PNG?
PNG uses DEFLATE (LZ77 + Huffman), not standalone RLE. However, some webP and lossless JPEG variants include RLE-like steps. For modern lossless compression, RLE appears only as one part of a pipeline, never alone.
Can RLE compress encrypted data?
No. Encrypted data is indistinguishable from random data. RLE will expand it by roughly 2x. Encrypted data must be compressed before encryption, not after.
What is the difference between RLE and LZ77?
RLE only encodes runs of identical symbols. LZ77 encodes arbitrary repeated sequences (not necessarily repeated characters). RLE is a special case of LZ77 where the match is always length N of the same symbol. LZ77 is more general and achieves higher compression on text.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.