DSA Beginner

Run-Length Encoding — 255-Run Limit Bug in Fax

📅 March 24, 2026 ⏱ 3 min read 🎯 Beginner

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Data Compression → Topic 1 of 4

Fax images corrupted by white bands every 256 pixels because RLE's byte-sized count maxed at 255.

🧑‍💻 Beginner-friendly — no prior DSA experience needed

In this tutorial, you'll learn

Fax images corrupted by white bands every 256 pixels because RLE's byte-sized count maxed at 255.

RLE replaces consecutive identical symbols with (count, symbol) — O(n) encode and decode.
Compression ratio depends entirely on run length — 720 bytes of fax → 6 bytes (120x).
Fails on alternating or random data — can expand the input by up to 2x.

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

RLE replaces consecutive identical symbols with a (count, symbol) pair.
Encode and decode both run in O(n) time, O(r) space (r = number of runs).
Best-case compression: ~120x for fax lines with long white runs.
Worst-case expansion: 2x for alternating data like 'ABAB'.
Production insight: RLE is never used alone in modern compressors — it's a building block in bzip2 (BWT+MTF+RLE+Huffman).
Biggest mistake: applying RLE to random or already-compressed data — you'll expand the file, not shrink it.

🚨 START HERE

RLE Debug Cheat Sheet

Quick diagnostic commands and fixes for common RLE problems

🟡

Decoded text has extra or missing characters

Immediate ActionRun a diff between original and decoded string.

Commands

python -c "import rle; print(rle.decode_from_file('output.bin') == open('original.txt').read())"

python -c "print([(c,s) for c,s in rle.encode(open('faulty.txt').read()) if c>255 or c<1])"

Fix NowFix the encoding loop to handle the last character outside the loop, then re-decode.

🟡

Compressed file is larger than original

Immediate ActionCompute average run length on a sample.

Commands

python -c "data=open('input.bin','rb').read(); runs=1+sum(1 for i in range(1,len(data)) if data[i]!=data[i-1]); print('avg run:', len(data)/runs)"

python -c "from itertools import groupby; data=open('input.bin','rb').read(); lens=[len(list(g)) for _,g in groupby(data)]; print('min',min(lens),'max',max(lens))"

Fix NowIf avg run < 2, RLE is the wrong tool. Switch to a dictionary-based compressor.

Production Incident

The 255-Run Limit That Corrupted a Fax Archive

A production fax server silently corrupted images because the RLE encoder used a single byte for run counts, capping at 255. Runs longer than that were split, but the decoder assumed contiguous runs were wrapped — producing distorted pages.

SymptomFax images showed horizontal white bands at regular intervals, exactly 256 pixels wide. The same pattern appeared on every page with solid white margins.

AssumptionThe RLE implementation was correct because it passed unit tests with short runs. The team assumed fax pages never have runs longer than 255 bytes.

Root causeRLE encoder used byte-sized count field (max 255). Fax pages with margins of 300+ white pixels caused the encoder to split the run into two (255, then remainder). The decoder treated those as separate runs and printed them sequentially, creating a visible seam on every row.

FixChanged the encoder to use a two-byte count (uint16) or to emit multiple (255, symbol) pairs and have the decoder merge adjacent identical runs. The second approach avoided breaking existing decoders.

Key Lesson

Never assume maximum run length fits in a single byte — real-world data always surprises you.Always test edge cases: empty input, single character, maximum runs, alternating patterns.Document the count field size in the format spec — every implementer needs to know.

Production Debug Guide

Symptom → Action guide for common RLE failures in production

Decoded output does not match the original input→Verify run count values: print all (count, symbol) pairs and compare to the original data. Look for off-by-one errors on the last run or incorrect handling of single-character input.

Compression ratio is worse than expected (below 1.0)→Check if the input data is random or already compressed. RLE expands on data with short runs. If you can't change the data, consider using a different compressor (LZ77, Huffman).

Encoded output has incorrect length or trailing bytes→Inspect the encoding loop: ensure you flush the final run after the last character. Use a debug print to confirm the loop iterates over all characters including the last.

Binary RLE produces invalid bytes (count > 255)→Enforce count cap at 255 and emit multiple pairs for longer runs. Or switch to a variable-length encoding (e.g., use 0x00 as an escape for large counts).

Run-Length Encoding is the first compression algorithm most people encounter, and it teaches the fundamental tension in all compression: the algorithm's effectiveness depends entirely on the structure of the data. For a fax transmission of a printed page — mostly white pixels with sparse black text — RLE achieves 50:1 compression. For a natural photograph, RLE can make the file larger.

This data-dependency is not a bug; it is the nature of compression. Shannon's source coding theorem says you can only compress data if it contains redundancy. RLE exploits one specific type of redundancy: runs of identical symbols. Understanding when your data has this structure — and when it doesn't — is the practical skill.

Basic RLE Implementation

Here's the standard RLE encoder and decoder for strings. The encoder scans left to right, counting consecutive identical characters. When the character changes, it appends the (count, char) tuple. The decoder uses multiplication to expand each tuple.

Notice the edge case handling: empty input returns an empty list, and the final run is appended after the loop. You'll see that the compression ratio varies wildly — the fax line (720 characters, mostly long runs) compresses 120x, while alternating 'ABAB' expands by a factor of 2.

rle.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536

def rle_encode(data: str) -> list[tuple]:
    """Returns list of (count, char) pairs."""
    if not data:
        return []
    encoded = []
    count = 1
    for i in range(1, len(data)):
        if data[i] == data[i-1]:
            count += 1
        else:
            encoded.append((count, data[i-1]))
            count = 1
    encoded.append((count, data[-1]))
    return encoded

def rle_decode(encoded: list[tuple]) -> str:
    return ''.join(ch * count for count, ch in encoded)

def rle_ratio(data: str) -> float:
    encoded = rle_encode(data)
    # Each pair needs 1 byte count + 1 byte char = 2 bytes
    compressed_size = len(encoded) * 2
    return len(data) / compressed_size

# Good case: long runs
fax_line = 'W' * 400 + 'B' * 20 + 'W' * 300  # typical fax line
print(f'Fax: {len(fax_line)} → {len(rle_encode(fax_line))*2} bytes ({rle_ratio(fax_line):.1f}x)')

# Bad case: alternating
alternating = 'ABABABABABABABAB'
print(f'Alternating: {len(alternating)} → {len(rle_encode(alternating))*2} bytes ({rle_ratio(alternating):.2f}x)')

# Typical bitmap
bitmap = 'W'*50+'B'*3+'W'*47
enc = rle_encode(bitmap)
print(f'Encoded: {enc}')

▶ Output

Fax: 720 → 6 bytes (120.0x)
Alternating: 16 → 32 bytes (0.50x)
Encoded: [(50, 'W'), (3, 'B'), (47, 'W')]

📊 Production Insight

In a real fax pipeline, the encoder must handle runs that exceed 255 when using single-byte counts.

Always flush the last run outside the loop — a classic off-by-one bug causes data corruption.

Rule: test with empty input, single character, alternating pattern, and long runs.

🎯 Key Takeaway

RLE encode is O(n), decode is O(compressed size).

Compression ratio depends entirely on average run length.

If avg run < 2, RLE will expand your data — choose a different algorithm.

Binary RLE and Escape Codes

For raw byte data, we can pack run-length pairs as [count][byte]. The count is a single byte (0–255), limiting run length to 255. To handle longer runs, you split them into multiple pairs. This format is used in early image formats like PCX and in some raw bitmap encodings.

The example shows a solid block of 100 bytes of 0xFF, followed by 200 of 0x00, then 50 of 0x80. The byte-level RLE compresses this from 350 bytes to just 6 — a 58x ratio. But if the data is random (short runs), each byte pair uses 2 bytes, so random data doubles in size.

binary_rle.py · PYTHON

123456789101112131415161718

def rle_encode_bytes(data: bytes) -> bytes:
    """Compact binary RLE: [count][byte] pairs, count in 1 byte (max 255)."""
    if not data:
        return b''
    result = bytearray()
    i = 0
    while i < len(data):
        byte = data[i]
        count = 1
        while i + count < len(data) and data[i+count] == byte and count < 255:
            count += 1
        result.extend([count, byte])
        i += count
    return bytes(result)

raw = bytes([255]*100 + [0]*200 + [128]*50)
compressed = rle_encode_bytes(raw)
print(f'Raw: {len(raw)} bytes → Compressed: {len(compressed)} bytes ({len(raw)/len(compressed):.1f}x)')

▶ Output

Raw: 350 bytes → Compressed: 6 bytes (58.3x)

📊 Production Insight

Binary RLE with a fixed 1-byte count fails on runs longer than 255 — you'll corrupt data silently.

The same caution applies to any fixed-size field in a binary compression format.

Rule: always cap run length at field max and emit multiple pairs, or use a variable-length encoding.

🎯 Key Takeaway

Byte-level RLE is simple but fragile with long runs.

Split runs longer than 255 into (255, byte) chunks.

Decoder must merge adjacent identical symbols to reproduce original data.

RLE in Image Formats: BMP, PCX, and TIFF

RLE was widely adopted in early image formats. BMP supports 4-bit and 8-bit RLE encoding (BI_RLE4, BI_RLE8) for indexed color images. PCX used RLE per scanline: if the two high bits of a byte are set, the lower 6 bits are a repeat count, otherwise the byte is a literal. TIFF has an RLE option in its compression tag.

In practice, these formats are legacy. Modern versions of BMP rarely use RLE because photographs produce short runs. However, for pixel art, icons, or diagrams with large uniform areas, RLE remains effective.

Mental Model

Mental model: RLE in image formats is like drawing with a paint bucket

Think of RLE as telling the computer "paint this color for the next N pixels" instead of listing every pixel.

A solid blue sky in a photo: runs are long, RLE shrinks data.
A grass texture with alternating green shades: runs are short, RLE expands.
BMP RLE writes a series of commands: (repeat, color) or (literal count, literals...).
PCX RLE distinguishes between literal and run bytes by checking the top bits.

📊 Production Insight

A common trap is assuming BMP RLE works on 24-bit images — it only applies to 8-bit or 4-bit color depth. Applying RLE to a true-color BMP will either fail or silently expand the file.

Rule: check the BMP header's bits-per-pixel field before choosing RLE encoding.

🎯 Key Takeaway

BMP RLE is limited to paletted images (<=8bpp).

PCX RLE uses a 6-bit count in the high nibble — max 63 repeats per byte.

For modern uses, prefer PNG (DEFLATE) over RLE-based formats.

RLE as a Building Block in bzip2

The bzip2 compression pipeline uses RLE in two places: 1. Pre-RLE: a first pass that replaces runs of 4–255 identical bytes with a special marker and a count. This is applied to the raw input to reduce space before the Burrows-Wheeler Transform (BWT). 2. Post-MTF RLE: after BWT and Move-to-Front (MTF) coding, the output contains many zero runs. RLE encodes these zero runs (run-length encoding of zeros) before final Huffman coding.

Together with BWT (which groups similar symbols together) and MTF (which outputs small numbers for frequent symbols), RLE turns a structured but not yet compressed intermediate into a highly compressible sequence. This is why bzip2 can beat gzip on certain data (e.g., text files with repetitive content).

bzip2_pre_rle.py · PYTHON

1234567891011121314151617181920

# Simplified pre-RLE step from bzip2 (run-length encoding of repeated bytes)
def bzip2_pre_rle(data: bytes) -> bytes:
    result = bytearray()
    i = 0
    while i < len(data):
        byte = data[i]
        count = 1
        while i + count < len(data) and data[i+count] == byte and count < 256:
            count += 1
        if count >= 4:
            # Emit special symbol: (byte repeated 4 times) + (count-4 as byte)
            result.extend([byte]*4)
            result.append(count - 4)
        else:
            result.extend([byte]*count)
        i += count
    return bytes(result)

# Example: 'AAAAA' -> 'AAAA' + 0x01 (since 5-4=1)
print(bzip2_pre_rle(b'AAAAABBCCCDDDD'))

▶ Output

b'AAAA\x01BBCCCDDDD\x00' # DDDD (4) is represented as 4 D's + 0x00 (4-4=0)

📊 Production Insight

If you run bzip2 on already-compressed data (jpeg, zip), the BWT stage often becomes ineffective, and the RLE stage may actually expand the data. The result: a larger file than the original.

Rule: never compress already-compressed data. Always check file type headers before processing.

🎯 Key Takeaway

RLE in bzip2 exploits the zero-run structure after MTF — that's the critical insight.

Without BWT+MTF, RLE alone can't achieve high compression on text.

bzip2's RLE is specialized (zero-only) — not a general-purpose run encoder.

Performance Considerations and Edge Cases

RLE is O(n) in time and O(r) in space (r = number of runs). Decode is O(n) in the uncompressed size. The algorithm is trivial, but practical implementations must handle:

Empty input: return empty output.
Single character: the loop must still append the final run.
Maximum run length: if using fixed-size count field, cap and split.
Non-printable symbols: binary data must encode counts and bytes correctly, not confuse them with delimiters.
Large datasets: RLE is not a bottleneck — even on 1 GB files, a naive Python implementation takes a few seconds.

Memory wise: the encoder stores the list of pairs, which can be large if the number of runs is high (e.g., random data). For worst-case (every character differs), the compressed output is 2x the input size. That's a 100% overhead. Always check the ratio before committing to RLE as a delivery format.

📊 Production Insight

In a real email attachment filter, RLE was used to reduce scan times on images. But the naive implementation allocated a new list for every image, causing OOM on 300 MB random-like graphics. Fix: use a streaming encoder that writes to a file directly, or pre-check the entropy.

Rule: stream encoding avoids memory explosions on pathological inputs.

🎯 Key Takeaway

RLE is compute-cheap but memory-proportional to number of runs.

Worst-case memory is 2x input size (every char is a new run).

Always bound memory usage — use streaming or chunked processing for large inputs.

Should you use RLE?

IfData has long runs (avg run > 10)

→

UseRLE is excellent — expect 5x+ compression.

IfData is random or already compressed

→

UseRLE expands (up to 2x) — use Huffman, LZ77, or a modern codec.

IfYou need both compression and real-time performance

→

UseRLE decode is fast enough for video streams — combine with other methods.

IfYou must compress binary data with unknown max run length

→

UseUse variable-length counts (e.g., escape code) or cap at 255 and split.

🗂 RLE vs Other Lossless Compressors

When to use RLE and when to skip it

Criterion	RLE	Huffman	LZ77 (Deflate)
Time complexity	O(n) encode, O(n) decode	O(n) with frequency table	O(n) typical, O(n^2) worst-case
Compression ratio (text)	0.5x – 2x	2x – 3x	3x – 5x
Compression ratio (fax)	50x – 120x	~2x	~10x
Memory usage	O(runs) ~ O(n) worst	O(alphabet size)	Sliding window + dictionary
Patent issues	None	None	Patented in early days, now free
Streaming support	Trivial	Needs Huffman tree first	Yes, with window
Friendly for teaching	Yes	Yes	Moderate

🎯 Key Takeaways

RLE replaces consecutive identical symbols with (count, symbol) — O(n) encode and decode.
Compression ratio depends entirely on run length — 720 bytes of fax → 6 bytes (120x).
Fails on alternating or random data — can expand the input by up to 2x.
Used in fax (T4 standard), BMP, PCX, TIFF, and as a building block in bzip2.
Simplest possible compressor — good teaching introduction to the compression/entropy tradeoff.
Production pitfalls: fixed-size count fields, forgetting final run, applying RLE to wrong data.

⚠ Common Mistakes to Avoid

✕Using RLE on random or already-compressed data

Symptom

Output file is 2x the input size, or compression ratio is consistently < 1.0.

Fix

Check entropy: compute average run length. If < 2, switch to a dictionary-based compressor (gzip, zlib).

✕Not handling runs longer than the count field can store

Symptom

Silent data corruption in binary RLE when input contains solid blocks longer than 255 bytes.

Fix

Cap counts at 255 and emit multiple (255, byte) pairs, or use a variable-length encoding (e.g., 0x00 escape + 2-byte count).

✕Forgetting to append the last run after the encoding loop

Symptom

Decoded output is missing the final character or run. Common in first implementations.

Fix

Always add encoded.append((count, data[-1])) after the loop, or use a state machine that appends on character change and at end.

✕Using RLE for natural language text

Symptom

Compression ratio below 1, file size increases.

Fix

English text has avg run length ~1.1 – RLE makes it larger. Use Huffman or LZ77 instead.

Interview Questions on This Topic

QWhat type of data does RLE compress well, and what data does it expand?JuniorReveal
RLE compresses well on data with long runs of identical symbols: fax images (mostly white), simple graphics, pixel art. It expands on data with short runs: natural language text, random data, already-compressed files. The compression ratio equals the inverse of the average run length (theoretical limit).
QImplement RLE encode and decode for a string.JuniorReveal
``python def rle_encode(s): if not s: return [] res, count = [], 1 for i in range(1, len(s)): if s[i] == s[i-1]: count += 1 else: res.append((count, s[i-1])); count = 1 res.append((count, s[-1])) return res def rle_decode(enc): return ''.join(ch*c for c,ch in enc)`` Edge cases: empty string, single character, alternating patterns.
QHow is RLE used in the bzip2 compression pipeline?Mid-levelReveal
bzip2 applies RLE twice: first on the raw input (run-length encoding of identical bytes, runs >=4), and second after the Burrows-Wheeler Transform and Move-to-Front coding (encoding zero runs). The zero-run RLE exploits the large number of zeros produced by MTF. This is essential because BWT produces runs of repeated characters, and MTF maps them to zeros, so runs of zeros are long and compressible with RLE.
QWhat is the worst-case expansion ratio for RLE?SeniorReveal
The worst case is every character differs from its predecessor (e.g., alternating 'ABABAB...'). For a string of length n, RLE produces n pairs, each taking 2 units (count + symbol). So compressed size = 2n, original = n → expansion ratio 2.0. In practice, if the count field is larger than 1 byte or uses variable-length encoding, the expansion can be worse. The theoretical worst case for any lossless compressor is unbounded if the data is incompressible, but RLE's worst-case output size is O(input_size).

Frequently Asked Questions

Does JPEG use RLE?

JPEG uses RLE for zero-coefficient runs in its zig-zag scan step (encoding runs of zeros after quantisation), combined with Huffman coding. It is not pure RLE — the DCT and quantisation stages do the heavy lifting.

Is RLE used in modern formats like PNG?

PNG uses DEFLATE (LZ77 + Huffman), not standalone RLE. However, some webP and lossless JPEG variants include RLE-like steps. For modern lossless compression, RLE appears only as one part of a pipeline, never alone.

Can RLE compress encrypted data?

No. Encrypted data is indistinguishable from random data. RLE will expand it by roughly 2x. Encrypted data must be compressed before encryption, not after.

What is the difference between RLE and LZ77?

RLE only encodes runs of identical symbols. LZ77 encodes arbitrary repeated sequences (not necessarily repeated characters). RLE is a special case of LZ77 where the match is always length N of the same symbol. LZ77 is more general and achieves higher compression on text.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged