Senior 3 min · March 24, 2026

Run-Length Encoding — 255-Run Limit Bug in Fax

Fax images corrupted by white bands every 256 pixels because RLE's byte-sized count maxed at 255.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • RLE replaces consecutive identical symbols with a (count, symbol) pair.
  • Encode and decode both run in O(n) time, O(r) space (r = number of runs).
  • Best-case compression: ~120x for fax lines with long white runs.
  • Worst-case expansion: 2x for alternating data like 'ABAB'.
  • Production insight: RLE is never used alone in modern compressors — it's a building block in bzip2 (BWT+MTF+RLE+Huffman).
  • Biggest mistake: applying RLE to random or already-compressed data — you'll expand the file, not shrink it.
Plain-English First

Run-Length Encoding is the simplest compression idea: instead of storing AAABBBBBCCCC, store 3A5B4C. Consecutive runs of the same character are replaced by a count and the character. It works brilliantly on data with long runs (fax images, simple graphics) and fails badly on random data (English text) where it can make things larger.

Run-Length Encoding is the first compression algorithm most people encounter, and it teaches the fundamental tension in all compression: the algorithm's effectiveness depends entirely on the structure of the data. For a fax transmission of a printed page — mostly white pixels with sparse black text — RLE achieves 50:1 compression. For a natural photograph, RLE can make the file larger.

This data-dependency is not a bug; it is the nature of compression. Shannon's source coding theorem says you can only compress data if it contains redundancy. RLE exploits one specific type of redundancy: runs of identical symbols. Understanding when your data has this structure — and when it doesn't — is the practical skill.

Basic RLE Implementation

Here's the standard RLE encoder and decoder for strings. The encoder scans left to right, counting consecutive identical characters. When the character changes, it appends the (count, char) tuple. The decoder uses multiplication to expand each tuple.

Notice the edge case handling: empty input returns an empty list, and the final run is appended after the loop. You'll see that the compression ratio varies wildly — the fax line (720 characters, mostly long runs) compresses 120x, while alternating 'ABAB' expands by a factor of 2.

rle.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def rle_encode(data: str) -> list[tuple]:
    """Returns list of (count, char) pairs."""
    if not data:
        return []
    encoded = []
    count = 1
    for i in range(1, len(data)):
        if data[i] == data[i-1]:
            count += 1
        else:
            encoded.append((count, data[i-1]))
            count = 1
    encoded.append((count, data[-1]))
    return encoded

def rle_decode(encoded: list[tuple]) -> str:
    return ''.join(ch * count for count, ch in encoded)

def rle_ratio(data: str) -> float:
    encoded = rle_encode(data)
    # Each pair needs 1 byte count + 1 byte char = 2 bytes
    compressed_size = len(encoded) * 2
    return len(data) / compressed_size

# Good case: long runs
fax_line = 'W' * 400 + 'B' * 20 + 'W' * 300  # typical fax line
print(f'Fax: {len(fax_line)} → {len(rle_encode(fax_line))*2} bytes ({rle_ratio(fax_line):.1f}x)')

# Bad case: alternating
alternating = 'ABABABABABABABAB'
print(f'Alternating: {len(alternating)} → {len(rle_encode(alternating))*2} bytes ({rle_ratio(alternating):.2f}x)')

# Typical bitmap
bitmap = 'W'*50+'B'*3+'W'*47
enc = rle_encode(bitmap)
print(f'Encoded: {enc}')
Output
Fax: 720 → 6 bytes (120.0x)
Alternating: 16 → 32 bytes (0.50x)
Encoded: [(50, 'W'), (3, 'B'), (47, 'W')]
Production Insight
In a real fax pipeline, the encoder must handle runs that exceed 255 when using single-byte counts.
Always flush the last run outside the loop — a classic off-by-one bug causes data corruption.
Rule: test with empty input, single character, alternating pattern, and long runs.
Key Takeaway
RLE encode is O(n), decode is O(compressed size).
Compression ratio depends entirely on average run length.
If avg run < 2, RLE will expand your data — choose a different algorithm.

Binary RLE and Escape Codes

For raw byte data, we can pack run-length pairs as [count][byte]. The count is a single byte (0–255), limiting run length to 255. To handle longer runs, you split them into multiple pairs. This format is used in early image formats like PCX and in some raw bitmap encodings.

The example shows a solid block of 100 bytes of 0xFF, followed by 200 of 0x00, then 50 of 0x80. The byte-level RLE compresses this from 350 bytes to just 6 — a 58x ratio. But if the data is random (short runs), each byte pair uses 2 bytes, so random data doubles in size.

binary_rle.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def rle_encode_bytes(data: bytes) -> bytes:
    """Compact binary RLE: [count][byte] pairs, count in 1 byte (max 255)."""
    if not data:
        return b''
    result = bytearray()
    i = 0
    while i < len(data):
        byte = data[i]
        count = 1
        while i + count < len(data) and data[i+count] == byte and count < 255:
            count += 1
        result.extend([count, byte])
        i += count
    return bytes(result)

raw = bytes([255]*100 + [0]*200 + [128]*50)
compressed = rle_encode_bytes(raw)
print(f'Raw: {len(raw)} bytes → Compressed: {len(compressed)} bytes ({len(raw)/len(compressed):.1f}x)')
Output
Raw: 350 bytes → Compressed: 6 bytes (58.3x)
Production Insight
Binary RLE with a fixed 1-byte count fails on runs longer than 255 — you'll corrupt data silently.
The same caution applies to any fixed-size field in a binary compression format.
Rule: always cap run length at field max and emit multiple pairs, or use a variable-length encoding.
Key Takeaway
Byte-level RLE is simple but fragile with long runs.
Split runs longer than 255 into (255, byte) chunks.
Decoder must merge adjacent identical symbols to reproduce original data.

RLE in Image Formats: BMP, PCX, and TIFF

RLE was widely adopted in early image formats. BMP supports 4-bit and 8-bit RLE encoding (BI_RLE4, BI_RLE8) for indexed color images. PCX used RLE per scanline: if the two high bits of a byte are set, the lower 6 bits are a repeat count, otherwise the byte is a literal. TIFF has an RLE option in its compression tag.

In practice, these formats are legacy. Modern versions of BMP rarely use RLE because photographs produce short runs. However, for pixel art, icons, or diagrams with large uniform areas, RLE remains effective.

Mental model: RLE in image formats is like drawing with a paint bucket
  • A solid blue sky in a photo: runs are long, RLE shrinks data.
  • A grass texture with alternating green shades: runs are short, RLE expands.
  • BMP RLE writes a series of commands: (repeat, color) or (literal count, literals...).
  • PCX RLE distinguishes between literal and run bytes by checking the top bits.
Production Insight
A common trap is assuming BMP RLE works on 24-bit images — it only applies to 8-bit or 4-bit color depth. Applying RLE to a true-color BMP will either fail or silently expand the file.
Rule: check the BMP header's bits-per-pixel field before choosing RLE encoding.
Key Takeaway
BMP RLE is limited to paletted images (<=8bpp).
PCX RLE uses a 6-bit count in the high nibble — max 63 repeats per byte.
For modern uses, prefer PNG (DEFLATE) over RLE-based formats.

RLE as a Building Block in bzip2

The bzip2 compression pipeline uses RLE in two places: 1. Pre-RLE: a first pass that replaces runs of 4–255 identical bytes with a special marker and a count. This is applied to the raw input to reduce space before the Burrows-Wheeler Transform (BWT). 2. Post-MTF RLE: after BWT and Move-to-Front (MTF) coding, the output contains many zero runs. RLE encodes these zero runs (run-length encoding of zeros) before final Huffman coding.

Together with BWT (which groups similar symbols together) and MTF (which outputs small numbers for frequent symbols), RLE turns a structured but not yet compressed intermediate into a highly compressible sequence. This is why bzip2 can beat gzip on certain data (e.g., text files with repetitive content).

bzip2_pre_rle.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Simplified pre-RLE step from bzip2 (run-length encoding of repeated bytes)
def bzip2_pre_rle(data: bytes) -> bytes:
    result = bytearray()
    i = 0
    while i < len(data):
        byte = data[i]
        count = 1
        while i + count < len(data) and data[i+count] == byte and count < 256:
            count += 1
        if count >= 4:
            # Emit special symbol: (byte repeated 4 times) + (count-4 as byte)
            result.extend([byte]*4)
            result.append(count - 4)
        else:
            result.extend([byte]*count)
        i += count
    return bytes(result)

# Example: 'AAAAA' -> 'AAAA' + 0x01 (since 5-4=1)
print(bzip2_pre_rle(b'AAAAABBCCCDDDD'))
Output
b'AAAA\x01BBCCCDDDD\x00' # DDDD (4) is represented as 4 D's + 0x00 (4-4=0)
Production Insight
If you run bzip2 on already-compressed data (jpeg, zip), the BWT stage often becomes ineffective, and the RLE stage may actually expand the data. The result: a larger file than the original.
Rule: never compress already-compressed data. Always check file type headers before processing.
Key Takeaway
RLE in bzip2 exploits the zero-run structure after MTF — that's the critical insight.
Without BWT+MTF, RLE alone can't achieve high compression on text.
bzip2's RLE is specialized (zero-only) — not a general-purpose run encoder.

Performance Considerations and Edge Cases

RLE is O(n) in time and O(r) in space (r = number of runs). Decode is O(n) in the uncompressed size. The algorithm is trivial, but practical implementations must handle:

  • Empty input: return empty output.
  • Single character: the loop must still append the final run.
  • Maximum run length: if using fixed-size count field, cap and split.
  • Non-printable symbols: binary data must encode counts and bytes correctly, not confuse them with delimiters.
  • Large datasets: RLE is not a bottleneck — even on 1 GB files, a naive Python implementation takes a few seconds.

Memory wise: the encoder stores the list of pairs, which can be large if the number of runs is high (e.g., random data). For worst-case (every character differs), the compressed output is 2x the input size. That's a 100% overhead. Always check the ratio before committing to RLE as a delivery format.

Production Insight
In a real email attachment filter, RLE was used to reduce scan times on images. But the naive implementation allocated a new list for every image, causing OOM on 300 MB random-like graphics. Fix: use a streaming encoder that writes to a file directly, or pre-check the entropy.
Rule: stream encoding avoids memory explosions on pathological inputs.
Key Takeaway
RLE is compute-cheap but memory-proportional to number of runs.
Worst-case memory is 2x input size (every char is a new run).
Always bound memory usage — use streaming or chunked processing for large inputs.
Should you use RLE?
IfData has long runs (avg run > 10)
UseRLE is excellent — expect 5x+ compression.
IfData is random or already compressed
UseRLE expands (up to 2x) — use Huffman, LZ77, or a modern codec.
IfYou need both compression and real-time performance
UseRLE decode is fast enough for video streams — combine with other methods.
IfYou must compress binary data with unknown max run length
UseUse variable-length counts (e.g., escape code) or cap at 255 and split.
● Production incidentPOST-MORTEMseverity: high

The 255-Run Limit That Corrupted a Fax Archive

Symptom
Fax images showed horizontal white bands at regular intervals, exactly 256 pixels wide. The same pattern appeared on every page with solid white margins.
Assumption
The RLE implementation was correct because it passed unit tests with short runs. The team assumed fax pages never have runs longer than 255 bytes.
Root cause
RLE encoder used byte-sized count field (max 255). Fax pages with margins of 300+ white pixels caused the encoder to split the run into two (255, then remainder). The decoder treated those as separate runs and printed them sequentially, creating a visible seam on every row.
Fix
Changed the encoder to use a two-byte count (uint16) or to emit multiple (255, symbol) pairs and have the decoder merge adjacent identical runs. The second approach avoided breaking existing decoders.
Key lesson
  • Never assume maximum run length fits in a single byte — real-world data always surprises you.
  • Always test edge cases: empty input, single character, maximum runs, alternating patterns.
  • Document the count field size in the format spec — every implementer needs to know.
Production debug guideSymptom → Action guide for common RLE failures in production4 entries
Symptom · 01
Decoded output does not match the original input
Fix
Verify run count values: print all (count, symbol) pairs and compare to the original data. Look for off-by-one errors on the last run or incorrect handling of single-character input.
Symptom · 02
Compression ratio is worse than expected (below 1.0)
Fix
Check if the input data is random or already compressed. RLE expands on data with short runs. If you can't change the data, consider using a different compressor (LZ77, Huffman).
Symptom · 03
Encoded output has incorrect length or trailing bytes
Fix
Inspect the encoding loop: ensure you flush the final run after the last character. Use a debug print to confirm the loop iterates over all characters including the last.
Symptom · 04
Binary RLE produces invalid bytes (count > 255)
Fix
Enforce count cap at 255 and emit multiple pairs for longer runs. Or switch to a variable-length encoding (e.g., use 0x00 as an escape for large counts).
★ RLE Debug Cheat SheetQuick diagnostic commands and fixes for common RLE problems
Decoded text has extra or missing characters
Immediate action
Run a diff between original and decoded string.
Commands
python -c "import rle; print(rle.decode_from_file('output.bin') == open('original.txt').read())"
python -c "print([(c,s) for c,s in rle.encode(open('faulty.txt').read()) if c>255 or c<1])"
Fix now
Fix the encoding loop to handle the last character outside the loop, then re-decode.
Compressed file is larger than original+
Immediate action
Compute average run length on a sample.
Commands
python -c "data=open('input.bin','rb').read(); runs=1+sum(1 for i in range(1,len(data)) if data[i]!=data[i-1]); print('avg run:', len(data)/runs)"
python -c "from itertools import groupby; data=open('input.bin','rb').read(); lens=[len(list(g)) for _,g in groupby(data)]; print('min',min(lens),'max',max(lens))"
Fix now
If avg run < 2, RLE is the wrong tool. Switch to a dictionary-based compressor.
RLE vs Other Lossless Compressors
CriterionRLEHuffmanLZ77 (Deflate)
Time complexityO(n) encode, O(n) decodeO(n) with frequency tableO(n) typical, O(n^2) worst-case
Compression ratio (text)0.5x – 2x2x – 3x3x – 5x
Compression ratio (fax)50x – 120x~2x~10x
Memory usageO(runs) ~ O(n) worstO(alphabet size)Sliding window + dictionary
Patent issuesNoneNonePatented in early days, now free
Streaming supportTrivialNeeds Huffman tree firstYes, with window
Friendly for teachingYesYesModerate

Key takeaways

1
RLE replaces consecutive identical symbols with (count, symbol)
O(n) encode and decode.
2
Compression ratio depends entirely on run length
720 bytes of fax → 6 bytes (120x).
3
Fails on alternating or random data
can expand the input by up to 2x.
4
Used in fax (T4 standard), BMP, PCX, TIFF, and as a building block in bzip2.
5
Simplest possible compressor
good teaching introduction to the compression/entropy tradeoff.
6
Production pitfalls
fixed-size count fields, forgetting final run, applying RLE to wrong data.

Common mistakes to avoid

4 patterns
×

Using RLE on random or already-compressed data

Symptom
Output file is 2x the input size, or compression ratio is consistently < 1.0.
Fix
Check entropy: compute average run length. If < 2, switch to a dictionary-based compressor (gzip, zlib).
×

Not handling runs longer than the count field can store

Symptom
Silent data corruption in binary RLE when input contains solid blocks longer than 255 bytes.
Fix
Cap counts at 255 and emit multiple (255, byte) pairs, or use a variable-length encoding (e.g., 0x00 escape + 2-byte count).
×

Forgetting to append the last run after the encoding loop

Symptom
Decoded output is missing the final character or run. Common in first implementations.
Fix
Always add encoded.append((count, data[-1])) after the loop, or use a state machine that appends on character change and at end.
×

Using RLE for natural language text

Symptom
Compression ratio below 1, file size increases.
Fix
English text has avg run length ~1.1 – RLE makes it larger. Use Huffman or LZ77 instead.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What type of data does RLE compress well, and what data does it expand?
Q02JUNIOR
Implement RLE encode and decode for a string.
Q03SENIOR
How is RLE used in the bzip2 compression pipeline?
Q04SENIOR
What is the worst-case expansion ratio for RLE?
Q01 of 04JUNIOR

What type of data does RLE compress well, and what data does it expand?

ANSWER
RLE compresses well on data with long runs of identical symbols: fax images (mostly white), simple graphics, pixel art. It expands on data with short runs: natural language text, random data, already-compressed files. The compression ratio equals the inverse of the average run length (theoretical limit).
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
Does JPEG use RLE?
02
Is RLE used in modern formats like PNG?
03
Can RLE compress encrypted data?
04
What is the difference between RLE and LZ77?
🔥

That's Data Compression. Mark it forged?

3 min read · try the examples if you haven't

Previous
Hamming Code — Error Detection and Correction
1 / 4 · Data Compression
Next
LZ77 and LZ78 — Dictionary Compression Algorithms