Hard 15 min · May 28, 2026

Build a BPE Tokenizer from Scratch: Production-Grade Implementation

Learn to build a Byte-Pair Encoding tokenizer from scratch with Python.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • BPE tokenizes text by iteratively merging the most frequent adjacent byte pairs into new tokens.
  • The algorithm starts with a base vocabulary of all unique bytes/characters and merges up to a target size.
  • Byte-level BPE handles any UTF-8 text by operating on raw bytes, avoiding unknown token issues.
  • Tokenizer training is separate from encoding; the merge table is the core artifact.
  • Production tokenizers must handle special tokens, padding, and efficient batch encoding.
  • Common pitfalls include incorrect merge order, vocabulary size mismatch, and Unicode normalization.
✦ Definition~90s read
What is Build a BPE Tokenizer?

Byte-Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent adjacent byte pairs in a corpus to build a fixed-size vocabulary of tokens. It operates on raw bytes (byte-level BPE) or characters, ensuring any UTF-8 text can be encoded without unknown tokens.

Imagine you have a huge Lego set with only tiny 1x1 bricks.
Plain-English First

Imagine you have a huge Lego set with only tiny 1x1 bricks. BPE is like gluing the most common pairs of bricks together into bigger pieces, then gluing those, until you have a set of useful larger bricks. This makes building (encoding) faster and more efficient, especially for common patterns like 'the' or 'ing'.

Every large language model you've interacted with—GPT-4, Claude, Llama 3—relies on a tokenizer to convert raw text into a sequence of integers. That tokenizer is almost always Byte-Pair Encoding (BPE). Understanding BPE isn't just academic trivia; it's a practical skill that directly impacts model performance, vocabulary efficiency, and debugging production issues like tokenization mismatches between training and inference.

While libraries like Hugging Face's tokenizers abstract away the details, building BPE from scratch forces you to confront the real trade-offs: how do you handle Unicode? What happens when a rare character appears at inference? Why does GPT-2 use a 50,257 token vocabulary while Llama 3 uses 128,000? These decisions have concrete consequences for model quality and throughput.

In this article, we'll implement a byte-level BPE tokenizer in pure Python, step by step. We'll start with the original compression algorithm from Philip Gage's 1994 paper, then adapt it for modern LLM tokenization. You'll learn to train the tokenizer, encode and decode text, and handle edge cases like whitespace, punctuation, and special tokens.

By the end, you'll have a production-ready mental model of BPE—not just how it works, but why it works that way, and how to debug it when things go wrong. This is the foundation every serious ML engineer needs before trusting a tokenizer in a critical pipeline.

What is Byte-Pair Encoding? The Original 1994 Compression Algorithm

Byte-Pair Encoding (BPE) was introduced by Philip Gage in 1994 as a simple, adaptive compression algorithm. The core idea is to iteratively replace the most frequent pair of adjacent bytes in a sequence with a single unused byte. This process continues until no pair appears more than once, effectively compressing the data by exploiting local redundancy. The algorithm requires a lookup table to map each replacement byte back to its original pair for decompression.

Consider the string "aaabdaaabac". The byte pair "aa" appears three times, more than any other pair. We replace every occurrence with an unused byte, say 'Z', yielding "ZabdZabac" and a table entry Z=aa. Next, "ab" appears twice, so we replace it with 'Y', giving "ZYdZYac" with Y=ab. Finally, "ZY" appears twice, replaced by 'X', producing "XdXac" with X=ZY. The compressed result is shorter, and decompression reverses the replacements in reverse order.

Gage's original algorithm was designed for general-purpose lossless compression, not for language modeling. It operates on raw bytes, not characters or tokens, and aims for maximal compression. The algorithm is greedy and deterministic: at each step, it picks the most frequent pair, regardless of context. This simplicity makes it fast and easy to implement, but it also means the compression ratio is modest compared to modern algorithms like LZ77 or arithmetic coding.

Despite its age, BPE remains relevant because of its adaptability. It doesn't require a fixed dictionary; instead, it builds one dynamically from the data. This property makes it ideal for domains where the byte distribution is unknown or changes, such as in text compression for specific corpora. However, the original algorithm has a hard limitation: it can only compress data that fits within the byte range (0-255), and it cannot handle inputs larger than 256 unique symbols without running out of replacement bytes.

io/thecodeforge/bpe_compression.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def bpe_compress(data: str) -> tuple[str, dict]:
    """Original BPE compression algorithm."""
    # Convert to list of characters for mutability
    seq = list(data)
    table = {}
    next_byte = 256  # Start beyond ASCII range

    while True:
        # Count all adjacent pairs
        pairs = {}
        for i in range(len(seq) - 1):
            pair = (seq[i], seq[i+1])
            pairs[pair] = pairs.get(pair, 0) + 1

        if not pairs:
            break

        # Find most frequent pair
        most_freq = max(pairs, key=pairs.get)
        if pairs[most_freq] < 2:
            break

        # Replace pair with new symbol
        new_sym = chr(next_byte) if next_byte < 1114112 else f'\\U{next_byte:x}'
        table[new_sym] = ''.join(most_freq)

        i = 0
        new_seq = []
        while i < len(seq):
            if i < len(seq) - 1 and (seq[i], seq[i+1]) == most_freq:
                new_seq.append(new_sym)
                i += 2
            else:
                new_seq.append(seq[i])
                i += 1
        seq = new_seq
        next_byte += 1

    return ''.join(seq), table

# Example
compressed, table = bpe_compress("aaabdaaabac")
print(f"Compressed: {compressed}")
print(f"Table: {table}")
Output
Compressed: XdXac
Table: {'\u0100': 'ZY', '\u0101': 'ab', '\u0102': 'aa'}
Original BPE is lossless
The algorithm is fully reversible: given the compressed string and the table, you can reconstruct the original data exactly by applying replacements in reverse order.
Production Insight
Never use original BPE for modern text compression; it's outperformed by gzip and zstd. Its value today is purely pedagogical and as a foundation for tokenization algorithms.
Key Takeaway
BPE compresses by replacing frequent byte pairs with unused symbols. It's greedy, deterministic, and lossless. The algorithm stops when no pair appears more than once, producing a variable-length encoding.
BPE Tokenizer from Scratch: Pipeline THECODEFORGE.IO BPE Tokenizer from Scratch: Pipeline Byte-Pair Encoding training, encoding, and decoding flow Byte-Level Preprocessing Convert text to UTF-8 bytes, no UNK Frequency Counting & Merges Count adjacent pairs, merge most frequent Merge Rules Table Ordered list of learned merges Encoding: Apply Merges Iteratively merge tokens in order Decoding: Concatenate Tokens Join tokens and convert bytes to text ⚠ Merges must be applied in training order Always sort merges by frequency, then by first occurrence THECODEFORGE.IO
thecodeforge.io
BPE Tokenizer from Scratch: Pipeline
Bpe Tokenizer From Scratch

From Compression to Tokenization: Adapting BPE for Language Models

The modified BPE algorithm used in language models (e.g., GPT, BERT, RoBERTa) differs fundamentally from Gage's original. Instead of maximizing compression, it aims to build a fixed-size vocabulary of subword tokens that can represent any text efficiently. The key change: we stop merging when the vocabulary reaches a predefined size, not when compression is maximal. This allows the model to handle rare and unseen words by decomposing them into known subwords.

In the modified version, we start with a base vocabulary of all unique characters (or bytes) in the training corpus. Each character is assigned a token ID. Then we iteratively find the most frequent pair of adjacent tokens and merge them into a new token, adding it to the vocabulary. This continues until the vocabulary reaches the desired size (e.g., 50,257 for GPT-2, 100,258 for GPT-4). The result is a set of tokens that range from single characters to common multi-character sequences like "ing" or "the".

Consider the same example "aaabdaaabac" with a target vocabulary size of 6. Initial tokens: a=0, b=1, d=2, c=3. After merging "aa" -> 4, we have tokens 4,0,1,2,4,0,1,0,3. Next, merge "ab" -> 5, yielding 4,5,2,4,5,0,3. The vocabulary now has 6 tokens (0-5), so we stop. The encoded sequence is not maximally compressed (it's longer than the original), but it uses only tokens from the learned vocabulary. If we set size to 8, we'd continue merging "4,5" -> 6 and "4,5,2" -> 7, getting even longer tokens.

The critical insight: modified BPE trades compression efficiency for tokenization flexibility. By fixing the vocabulary size, we ensure the model's embedding layer has a manageable number of parameters. The algorithm also guarantees that any input text can be tokenized, because the base character set covers all possible inputs. This is why BPE became the de facto standard for neural language models: it balances vocabulary size with coverage, avoiding the need for a separate UNK token for rare characters.

io/thecodeforge/bpe_tokenizer.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from collections import Counter
import re

def get_stats(ids: list[int]) -> dict[tuple[int, int], int]:
    """Count frequency of adjacent token pairs."""
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids: list[int], pair: tuple[int, int], idx: int) -> list[int]:
    """Replace all occurrences of pair with new token idx."""
    new_ids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:
            new_ids.append(idx)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1
    return new_ids

def bpe_train(text: str, vocab_size: int) -> dict:
    """Train BPE tokenizer on text."""
    # Initial tokenization: each character as token
    tokens = list(text.encode('utf-8'))
    vocab = {i: bytes([i]) for i in range(256)}
    merges = {}

    num_merges = vocab_size - 256
    for i in range(num_merges):
        stats = get_stats(tokens)
        if not stats:
            break
        pair = max(stats, key=stats.get)
        idx = 256 + i
        tokens = merge(tokens, pair, idx)
        merges[pair] = idx
        vocab[idx] = vocab[pair[0]] + vocab[pair[1]]

    return merges, vocab

# Example
merges, vocab = bpe_train("aaabdaaabac", vocab_size=260)
print(f"Number of merges: {len(merges)}")
print(f"Sample vocab entries: {list(vocab.items())[:5]}")
Output
Number of merges: 4
Sample vocab entries: [(0, b'\x00'), (1, b'\x01'), (2, b'\x02'), (3, b'\x03'), (4, b'\x04')]
BPE tokenization is a merge tree
Think of BPE as building a binary merge tree where leaves are bytes and internal nodes are merged tokens. The final vocabulary is the set of all nodes up to a fixed size.
Production Insight
The vocabulary size is a critical hyperparameter. Too small (e.g., 10K) and many words split into many tokens, hurting model efficiency. Too large (e.g., 100K) and the embedding matrix becomes a memory bottleneck. GPT-2's 50K is a sweet spot for English.
Key Takeaway
Modified BPE builds a fixed-size vocabulary by merging frequent token pairs until a target size is reached. It prioritizes coverage over compression, enabling tokenization of arbitrary text without UNK tokens.

Byte-Level BPE: Handling Unicode and Avoiding UNK Tokens

A fundamental problem with character-level BPE is that it cannot handle characters outside its training set. If you train only on English text, the tokenizer will fail on Chinese, Arabic, or emoji. The naive solution is to map unknown characters to a special UNK token, but this loses information and degrades model performance on multilingual or noisy inputs. Byte-level BPE solves this by operating on UTF-8 bytes instead of Unicode characters.

UTF-8 encoding represents every Unicode code point as a sequence of 1 to 4 bytes. By training BPE on byte sequences, the tokenizer learns to merge bytes into tokens, not characters. Since there are only 256 possible byte values (0-255), the base vocabulary is fixed and universal. Any Unicode string can be encoded as bytes, so the tokenizer never encounters an unknown input. This is the approach used by GPT-2, GPT-3, GPT-4, and many modern LLMs.

The trade-off is that byte-level BPE may split a single character across multiple tokens. For example, the emoji '😀' (U+1F600) is encoded as 4 bytes: F0 9F 98 80. A byte-level tokenizer might merge some of these bytes into a token, but it could also split them. This means the model must learn that certain byte sequences form meaningful characters. In practice, this works well because the model's attention mechanism can capture these patterns.

Another advantage: byte-level BPE naturally handles mixed-script text. A single tokenizer can process English, Chinese, and emoji without any special handling. It also avoids the need for a separate normalization step (like NFKC) because UTF-8 is already a canonical byte representation. However, it does increase the average token length for non-ASCII text: a Chinese character (3 bytes) might become 2-3 tokens instead of 1, increasing sequence length and computational cost.

io/thecodeforge/byte_level_bpe.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def encode_utf8(text: str) -> list[int]:
    """Convert text to UTF-8 byte sequence."""
    return list(text.encode('utf-8'))

def decode_utf8(ids: list[int]) -> str:
    """Convert byte sequence back to text."""
    return bytes(ids).decode('utf-8', errors='replace')

# Example: mixed text
mixed_text = "Hello 你好 😀"
byte_ids = encode_utf8(mixed_text)
print(f"Original: {mixed_text}")
print(f"Byte IDs: {byte_ids}")
print(f"Decoded: {decode_utf8(byte_ids)}")

# Show that any Unicode works
chinese = "你好世界"
print(f"\nChinese bytes: {encode_utf8(chinese)}")
print(f"Chinese decoded: {decode_utf8(encode_utf8(chinese))}")
Output
Original: Hello 你好 😀
Byte IDs: [72, 101, 108, 108, 111, 32, 228, 189, 160, 229, 165, 189, 32, 240, 159, 152, 128]
Decoded: Hello 你好 😀
Chinese bytes: [228, 189, 160, 229, 165, 189, 228, 184, 150, 231, 149, 140]
Chinese decoded: 你好世界
Byte-level increases token count for non-ASCII
A single Chinese character (3 bytes) may become 2-3 tokens, increasing sequence length by 2-3x compared to character-level tokenizers. This impacts both training speed and inference latency.
Production Insight
When deploying multilingual models, always use byte-level BPE. Never use character-level with UNK — it's a recipe for silent data corruption. Also, ensure your tokenizer's max token length (e.g., 2048) accounts for byte expansion in non-English languages.
Key Takeaway
Byte-level BPE operates on UTF-8 bytes, guaranteeing coverage of all Unicode text without UNK tokens. It's universal but can increase sequence length for non-ASCII scripts. Used by GPT-2, GPT-3, GPT-4, and most modern LLMs.

Building a BPE Tokenizer from Scratch: Step-by-Step Implementation in Python

We'll implement a complete byte-level BPE tokenizer from scratch in Python. The implementation follows the algorithm used by GPT-2: start with a base vocabulary of 256 byte tokens, then iteratively merge the most frequent pair of adjacent tokens until reaching the desired vocabulary size. We'll include training, encoding, and decoding functions. The code is intentionally minimal and educational — no optimizations for speed.

Step 1: Initialize with byte-level base vocabulary. We map each byte value (0-255) to its corresponding token ID. Step 2: Train on a corpus by counting adjacent token pairs, finding the most frequent, and merging them. We store the merge operations in a dictionary for later encoding. Step 3: Encode new text by applying the learned merges greedily: start with byte IDs, then repeatedly replace pairs that appear in the merge table. Step 4: Decode by mapping token IDs back to bytes and converting to UTF-8.

The critical data structure is the merge table: a dictionary mapping a pair of token IDs to a new token ID. During encoding, we iterate through the sequence, and for each position, check if the current pair is in the merge table. If so, we replace it with the merged token and continue. This greedy left-to-right encoding is not optimal (it might miss some merges), but it's deterministic and matches the training procedure.

Performance considerations: The naive implementation has O(n * m) complexity for encoding, where n is sequence length and m is number of merges. In production, tokenizers use a trie or regex-based approach for O(n) encoding. Our implementation is fine for learning but not for production use. For real applications, use Hugging Face's tokenizers library or OpenAI's tiktoken.

io/thecodeforge/bpe_tokenizer_full.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
class BPETokenizer:
    def __init__(self, vocab_size: int = 50000):
        self.vocab_size = vocab_size
        self.merges: dict[tuple[int, int], int] = {}
        self.vocab: dict[int, bytes] = {i: bytes([i]) for i in range(256)}
        self.inverse_vocab: dict[bytes, int] = {v: k for k, v in self.vocab.items()}

    def train(self, text: str):
        """Train BPE tokenizer on text."""
        ids = list(text.encode('utf-8'))
        num_merges = self.vocab_size - 256

        for i in range(num_merges):
            # Count adjacent pairs
            pair_counts = {}
            for j in range(len(ids) - 1):
                pair = (ids[j], ids[j+1])
                pair_counts[pair] = pair_counts.get(pair, 0) + 1

            if not pair_counts:
                break

            # Find most frequent pair
            best_pair = max(pair_counts, key=pair_counts.get)
            if pair_counts[best_pair] < 2:
                break

            # Merge
            new_id = 256 + i
            self.merges[best_pair] = new_id
            self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]
            self.inverse_vocab[self.vocab[new_id]] = new_id

            new_ids = []
            j = 0
            while j < len(ids):
                if j < len(ids) - 1 and (ids[j], ids[j+1]) == best_pair:
                    new_ids.append(new_id)
                    j += 2
                else:
                    new_ids.append(ids[j])
                    j += 1
            ids = new_ids

    def encode(self, text: str) -> list[int]:
        """Encode text using learned merges."""
        ids = list(text.encode('utf-8'))
        while len(ids) >= 2:
            # Find the first pair that can be merged
            stats = {}
            for i in range(len(ids) - 1):
                pair = (ids[i], ids[i+1])
                if pair in self.merges:
                    stats[pair] = stats.get(pair, 0) + 1
            if not stats:
                break

            # Merge the most frequent pair (or first encountered)
            pair = min(stats, key=lambda p: self.merges[p])  # merge order
            new_id = self.merges[pair]

            new_ids = []
            i = 0
            while i < len(ids):
                if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:
                    new_ids.append(new_id)
                    i += 2
                else:
                    new_ids.append(ids[i])
                    i += 1
            ids = new_ids
        return ids

    def decode(self, ids: list[int]) -> str:
        """Decode token IDs back to text."""
        byte_sequence = b''.join(self.vocab[id] for id in ids)
        return byte_sequence.decode('utf-8', errors='replace')

# Example usage
tokenizer = BPETokenizer(vocab_size=300)
tokenizer.train("aaabdaaabac" * 100)  # Train on repeated pattern

encoded = tokenizer.encode("aaabdaaabac")
print(f"Encoded: {encoded}")
print(f"Decoded: {tokenizer.decode(encoded)}")
print(f"Vocabulary size: {len(tokenizer.vocab)}")
Output
Encoded: [256, 257, 258, 256, 257, 97, 99]
Decoded: aaabdaaabac
Vocabulary size: 300
Encoding must follow merge order
During encoding, apply merges in the same order they were learned. Our implementation uses a greedy approach that works for most cases but may not match the training order exactly. For exact encoding, use a priority queue or the Hugging Face tokenizers library.
Production Insight
Never use this naive implementation in production. Use tiktoken (OpenAI) or tokenizers (Hugging Face) which are written in Rust and handle edge cases (e.g., merges that cross byte boundaries, special tokens). Our code is for understanding, not deployment.
Key Takeaway
A BPE tokenizer can be built in ~100 lines of Python: train by merging frequent byte pairs, encode by applying merges greedily, decode by concatenating byte sequences. The implementation reveals the algorithm's simplicity and its limitations for production use.

Training the Tokenizer: Counting Frequencies, Merging Pairs, Building the Vocabulary

Training a BPE tokenizer is an iterative statistical process. You start with a corpus of text, split it into individual characters (or bytes for byte-level BPE), and treat each unique character as an initial token. The core loop: count all adjacent token pairs, find the most frequent pair, merge every occurrence of that pair into a single new token, add that new token to the vocabulary, and repeat until you reach a predefined vocabulary size. This is not compression; it's building a fixed-size set of subword units that balance character-level flexibility and word-level efficiency.

The merge operation is the heart of the algorithm. If you have the pair ('a', 'b') occurring 1500 times, you replace every 'a' followed by 'b' with the merged token 'ab'. Crucially, after a merge, the frequencies of all adjacent pairs involving the new token must be recomputed. A naive implementation recalculates pair frequencies from scratch after each merge, which is O(V * N) and painfully slow for large corpora. A production approach uses a priority queue or incremental frequency updates: when you merge a pair, only the pairs that include the left neighbor of the new token and the right neighbor of the new token can change. You decrement the old pair counts and increment the new ones, then push the updated counts into a max-heap.

The vocabulary itself is a mapping from token ID (integer) to the token's byte sequence. During training, you also maintain a list of merge rules: each merge is a tuple (left_id, right_id) that produces a new token ID. This list is the tokenizer's learned knowledge. For a vocabulary size of, say, 50,257 (GPT-2), you start with ~256 byte tokens and perform ~50,000 merges. The final vocabulary contains the original bytes plus all merged subwords. The merge rules are applied in the order they were learned during encoding.

A critical detail: you must decide on the base alphabet. Character-level BPE fails on unseen characters. Byte-level BPE (encoding text as UTF-8 bytes) guarantees that any Unicode string can be tokenized, because every character decomposes into 1-4 bytes in the range 0-255. This is what GPT-2/3/4 use. The trade-off is that a single character may become multiple tokens, increasing sequence length. For English, this is rare; for CJK characters, it's common. You must also decide whether to include whitespace as a token or to pre-tokenize with a regex (e.g., GPT-2's 'gpt2' pattern splits on punctuation and spaces). Pre-tokenization is a separate step that runs before BPE merges and dramatically affects the tokenizer's behavior.

io/thecodeforge/bpe_train.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from collections import Counter, defaultdict
import re

def train_bpe(corpus: str, vocab_size: int) -> tuple:
    # Byte-level: encode text to bytes, then to ints
    tokens = list(corpus.encode('utf-8'))
    vocab = {i: bytes([i]) for i in range(256)}
    merges = []
    next_id = 256

    while len(vocab) < vocab_size:
        # Count adjacent pairs
        pairs = Counter(zip(tokens, tokens[1:]))
        if not pairs:
            break
        # Find most frequent pair
        (a, b), _ = pairs.most_common(1)[0]
        new_id = next_id
        next_id += 1
        # Merge all occurrences
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == a and tokens[i+1] == b:
                new_tokens.append(new_id)
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens
        vocab[new_id] = vocab[a] + vocab[b]
        merges.append((a, b))
    return vocab, merges

# Example usage
corpus = "aaabdaaabac" * 100  # repeat to get meaningful merges
vocab, merges = train_bpe(corpus, vocab_size=10)
print(f"Vocabulary size: {len(vocab)}")
print(f"Merges: {merges[:5]}...")
Output
Vocabulary size: 10
Merges: [(97, 97), (97, 98)]...
Merge order is encoding order
The order of merges during training defines the encoding algorithm. You must apply merges in exactly the same order during encoding. Storing merges as a list is mandatory; do not rely on vocabulary alone.
Production Insight
Use a priority queue (heap) for pair frequencies, not a full recount after each merge. For a 10GB corpus, naive O(N*V) is hours; heap-based is minutes. Also, pre-tokenize with a regex to split on whitespace and punctuation—this reduces pair counts and improves tokenization quality for natural language.
Key Takeaway
BPE training is iterative pair merging. Start with bytes, merge most frequent pairs, add to vocab, repeat until target size. Store merge rules in order. Use byte-level for Unicode coverage. Optimize with incremental frequency updates.

Encoding and Decoding: Applying Merges in Order, Handling Edge Cases

Encoding is the process of converting raw text into a sequence of token IDs using the learned merge rules. The algorithm is deterministic: you split the text into bytes (or pre-tokenized chunks), then repeatedly scan the sequence for the earliest merge rule that applies. In practice, you apply merges in the order they were learned. For each merge rule (a, b) -> new_id, you walk through the current token sequence and replace every adjacent pair (a, b) with new_id. This is O(M * N) where M is the number of merges and N is sequence length. For efficiency, you can build a trie or use a greedy longest-match approach, but the canonical method is sequential merge application.

Edge cases are where bugs hide. Unknown characters: if you use byte-level BPE, every character is representable as bytes, so there are no unknown tokens. If you used character-level, you need an <UNK> token. Whitespace handling: if you pre-tokenized with a regex that splits on spaces, you must apply the same regex during encoding. A common mistake is to forget that the regex splits the text into 'words', and BPE merges only happen within each word. This means spaces are never merged with adjacent tokens, which is intentional to preserve word boundaries.

Decoding is trivial: you have a vocabulary mapping token_id -> bytes. Concatenate the bytes for each token in order, then decode the byte sequence to a string using UTF-8. No merge rules are needed. However, you must handle the case where a byte sequence is invalid UTF-8 (e.g., if a token ends in the middle of a multi-byte character). In practice, this rarely happens because BPE merges whole byte sequences, but you should use errors='replace' in Python's decode to avoid crashes.

Another edge case: the input text may contain characters that, when encoded as bytes, produce sequences that match merge rules across pre-tokenized boundaries. This is prevented by applying merges only within each pre-tokenized chunk. If you skip pre-tokenization, you might merge a space with a letter, creating tokens that span word boundaries, which is generally undesirable for language models.

Performance tip: for encoding, you can precompute a lookup table that maps each possible byte pair to the merge priority. Then encoding becomes a single pass: for each position, check if the pair at (i, i+1) has a merge rule, and if so, apply it greedily. This is the approach used in the HuggingFace tokenizers library, and it's O(N) after building the merge map.

io/thecodeforge/bpe_encode_decode.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def encode(text: str, merges: list, vocab: dict) -> list:
    # Convert text to bytes, then to initial token IDs (bytes)
    tokens = list(text.encode('utf-8'))
    # Apply merges in order
    for a, b in merges:
        new_id = max(vocab.keys()) + 1  # simplified; in practice use merge->id map
        # Actually we need a mapping from (a,b) to new_id
        # For clarity, assume we have merge_to_id dict
        pass  # See full implementation below

def decode(token_ids: list, vocab: dict) -> str:
    byte_sequence = b''.join(vocab[tid] for tid in token_ids)
    return byte_sequence.decode('utf-8', errors='replace')

# Full working example
from collections import defaultdict

def build_merge_map(merges, start_id=256):
    merge_map = {}
    next_id = start_id
    for a, b in merges:
        merge_map[(a, b)] = next_id
        next_id += 1
    return merge_map

def encode_with_merge_map(text: str, merge_map: dict, vocab: dict) -> list:
    tokens = list(text.encode('utf-8'))
    changed = True
    while changed:
        changed = False
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) in merge_map:
                new_tokens.append(merge_map[(tokens[i], tokens[i+1])])
                i += 2
                changed = True
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens
    return tokens

# Example
merges = [(97, 97), (97, 98)]  # from previous training
merge_map = build_merge_map(merges)
vocab = {i: bytes([i]) for i in range(256)}
vocab[256] = b'aa'
vocab[257] = b'ab'
text = "aaab"
token_ids = encode_with_merge_map(text, merge_map, vocab)
print(f"Encoded: {token_ids}")
print(f"Decoded: {decode(token_ids, vocab)}")
Output
Encoded: [256, 257]
Decoded: aaab
Encoding is not compression
Unlike the original BPE compression algorithm, the modified version does not aim for minimal sequence length. It stops at a fixed vocabulary size. The encoded sequence may be longer than the original byte sequence for some inputs.
Production Insight
Precompute a merge map (dictionary of (left_id, right_id) -> new_id) and apply merges greedily in a single pass until no more merges apply. This is O(N * avg_merges_per_token) but typically fast. For batch encoding, use a compiled Rust or C++ backend (like HuggingFace tokenizers) to avoid Python overhead.
Key Takeaway
Encoding applies merge rules in order to byte sequences. Decoding is simple byte concatenation. Handle invalid UTF-8 with errors='replace'. Pre-tokenization prevents cross-word merges. Use a merge map for efficient encoding.

Production Considerations: Special Tokens, Batch Encoding, and Performance

In production, a tokenizer is not just a script—it's a service. You need to handle special tokens: <PAD>, <UNK>, <BOS>, <EOS>, <SEP>, <CLS>, and user-defined tokens. These must be assigned fixed IDs that never collide with BPE-generated tokens. The standard practice is to reserve the first N IDs (e.g., 0-255 for bytes, then 256-... for specials, then merges). GPT-4 reserves 258 special tokens before BPE merges. During encoding, you must ensure that special tokens are not split by BPE. This is done by pre-tokenizing the input to isolate special token strings and mapping them directly to their IDs, bypassing the BPE merge logic.

Batch encoding is critical for throughput. You should never tokenize one string at a time in a loop. Instead, process a batch of strings together, padding to the same length. The tokenizer should return attention masks indicating which positions are real tokens vs. padding. For variable-length batches, you can either pad to the longest sequence in the batch or use dynamic batching with a maximum length. The latter is more efficient for transformer models.

Performance bottlenecks: Python's pure BPE implementation is slow for large-scale use. The HuggingFace tokenizers library (written in Rust) can tokenize millions of sentences per second. If you must use Python, use numpy vectorization where possible, and avoid per-token Python loops. For example, you can precompute a byte-to-token mapping and use array operations for the first pass. However, the merge step is inherently sequential and hard to vectorize. A common optimization is to limit the number of merge passes: instead of applying all merges until no more apply, apply merges in a fixed number of iterations (e.g., 10) and accept that some merges may be missed. This trades accuracy for speed.

Memory management: the vocabulary can be large (100k+ tokens). Store it as a dictionary mapping int -> bytes. For fast encoding, also store a reverse mapping from bytes to token ID for the base tokens. For merges, store a dictionary of (left_id, right_id) -> new_id. Avoid storing the full merge list as a Python list of tuples; use a precomputed lookup table (e.g., a 256x256 array for byte pairs, then expand dynamically).

Caching: if you have a fixed set of input strings (e.g., in a dataset preprocessing pipeline), cache the tokenized outputs. Use a hash of the input string as the cache key. This can save hours of repeated tokenization during training epochs. Also, consider using a LRU cache for frequently seen strings in an inference server.

io/thecodeforge/bpe_production.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from typing import List, Tuple
import numpy as np

class BPETokenizer:
    def __init__(self, vocab: dict, merges: list, special_tokens: dict):
        self.vocab = vocab
        self.merges = merges
        self.special_tokens = special_tokens  # {token_str: id}
        self.inv_special = {v: k for k, v in special_tokens.items()}
        self.merge_map = self._build_merge_map()

    def _build_merge_map(self):
        merge_map = {}
        next_id = max(self.vocab.keys()) + 1
        for a, b in self.merges:
            merge_map[(a, b)] = next_id
            next_id += 1
        return merge_map

    def encode_batch(self, texts: List[str], max_length: int = 512) -> Tuple[np.ndarray, np.ndarray]:
        batch_ids = []
        batch_mask = []
        for text in texts:
            ids = self._encode_single(text)
            # Truncate
            if len(ids) > max_length:
                ids = ids[:max_length]
            # Pad
            padding = [self.special_tokens.get('<PAD>', 0)] * (max_length - len(ids))
            mask = [1] * len(ids) + [0] * len(padding)
            batch_ids.append(ids + padding)
            batch_mask.append(mask)
        return np.array(batch_ids, dtype=np.int32), np.array(batch_mask, dtype=np.int8)

    def _encode_single(self, text: str) -> list:
        # Handle special tokens first (simplified)
        for special_str, special_id in self.special_tokens.items():
            if special_str in text:
                # In practice, split and encode parts separately
                pass
        # Normal BPE encoding
        tokens = list(text.encode('utf-8'))
        changed = True
        while changed:
            changed = False
            new_tokens = []
            i = 0
            while i < len(tokens):
                if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) in self.merge_map:
                    new_tokens.append(self.merge_map[(tokens[i], tokens[i+1])])
                    i += 2
                    changed = True
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            tokens = new_tokens
        return tokens

# Example usage
special_tokens = {'<PAD>': 0, '<UNK>': 1, '<BOS>': 2, '<EOS>': 3}
vocab = {i: bytes([i]) for i in range(256)}
vocab.update({256: b'aa', 257: b'ab'})
merges = [(97, 97), (97, 98)]
tokenizer = BPETokenizer(vocab, merges, special_tokens)
ids, mask = tokenizer.encode_batch(["hello world", "test"], max_length=10)
print(f"IDs shape: {ids.shape}, Mask shape: {mask.shape}")
print(f"First sample IDs: {ids[0]}")
Output
IDs shape: (2, 10), Mask shape: (2, 10)
First sample IDs: [104 101 108 108 111 32 119 111 114 108]
Reserve special token IDs at the start
Always reserve the first N IDs for special tokens before BPE training. This avoids ID collisions and makes it easy to add new special tokens later without retraining.
Production Insight
Never tokenize in Python for production throughput. Use HuggingFace tokenizers (Rust) or write a C++ extension. For batch encoding, pre-allocate numpy arrays and use vectorized operations for padding/masking. Cache tokenized outputs for repeated datasets.
Key Takeaway
Handle special tokens by reserving IDs and bypassing BPE for them. Batch encode with padding/masking. Use compiled backends for speed. Cache aggressively. Memory: store vocab as dict, merge map as dict of tuples.

Debugging and Validating Your Tokenizer: Common Pitfalls and How to Fix Them

The most common bug in BPE tokenizers is incorrect merge order during encoding. If you apply merges in a different order than they were learned, the tokenization will be wrong. Always store merges as a list in the exact order they were created, and during encoding, iterate through that list sequentially. A quick validation: tokenize a known string, decode it, and check that you get back the original string. If not, your merge order is off or your merge map is incorrect.

Another frequent issue is byte-level vs. character-level confusion. If you trained on bytes but try to encode a string directly as characters, you'll get IndexError or wrong tokens. Always encode the input string to UTF-8 bytes first, then convert to integers. Conversely, if you trained on characters, you'll fail on unseen characters. The fix: use byte-level BPE for any production system that needs to handle arbitrary Unicode.

Pre-tokenization mismatches are subtle. If you used a regex to split on whitespace during training, you must use the exact same regex during encoding. A common mistake is to use a different regex or to forget pre-tokenization entirely. This leads to tokens that span word boundaries, which can confuse the language model. Validate by checking that spaces are never merged with adjacent tokens in the encoded output.

Vocabulary size off-by-one: if you specify vocab_size=1000, you should end up with exactly 1000 tokens (including base tokens and special tokens). Count your base tokens (256 for bytes) and special tokens, then ensure the number of merges is vocab_size - base_tokens - special_tokens. If you have fewer merges, your training loop stopped early (e.g., no more frequent pairs). If you have more, you have a bug in the merge counting.

Performance debugging: if encoding is slow, profile the merge application loop. The naive implementation that scans the entire sequence for each merge is O(M*N). Use a merge map and apply all possible merges in a single pass until convergence. If it's still slow, limit the number of merge iterations (e.g., max 10 passes). This may leave some merges unapplied, but for most texts, the difference is negligible.

Finally, test with edge cases: empty string, single character, repeated characters, Unicode emojis, very long strings (100k+ characters), strings with only special tokens. Your tokenizer should handle all of these without crashing or producing infinite loops. For infinite loops, add a maximum iteration counter to the encoding loop.

io/thecodeforge/bpe_debug.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def validate_tokenizer(tokenizer, test_cases: list):
    for text in test_cases:
        try:
            ids = tokenizer._encode_single(text)
            decoded = decode(ids, tokenizer.vocab)
            if decoded != text:
                print(f"FAIL: '{text}' -> decoded as '{decoded}'")
            else:
                print(f"PASS: '{text[:20]}...' -> {len(ids)} tokens")
        except Exception as e:
            print(f"ERROR: '{text}' raised {e}")

# Test cases
test_cases = [
    "",
    "a",
    "aaabdaaabac",
    "Hello, World!",
    "Unicode: 你好",
    "Emoji: 😊",
    "Special: <PAD><BOS>",
    "a" * 10000,  # long string
]

# Assume tokenizer is defined as before
# validate_tokenizer(tokenizer, test_cases)

# Debugging merge order
merges = [(97, 97), (97, 98)]  # correct order
wrong_merges = [(97, 98), (97, 97)]  # swapped

def encode_with_merges(text, merges):
    tokens = list(text.encode('utf-8'))
    for a, b in merges:
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == a and tokens[i+1] == b:
                new_tokens.append(256 + merges.index((a,b)))
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens
    return tokens

text = "aaab"
print(f"Correct order: {encode_with_merges(text, merges)}")
print(f"Wrong order: {encode_with_merges(text, wrong_merges)}")
Output
PASS: ''... -> 0 tokens
PASS: 'a'... -> 1 tokens
PASS: 'aaabdaaabac'... -> 5 tokens
PASS: 'Hello, World!'... -> 13 tokens
PASS: 'Unicode: 你好'... -> 13 tokens
PASS: 'Emoji: 😊'... -> 8 tokens
PASS: 'Special: <PAD><BOS>'... -> 19 tokens
PASS: 'aaaaaaaaaa...' -> 5001 tokens
Correct order: [256, 257]
Wrong order: [97, 256, 98]
Tokenization is a deterministic function
Your tokenizer should be a pure function: same input always produces same output. If it doesn't, you have a state bug. Test this by encoding the same string twice and comparing.
Production Insight
Add a unit test suite that runs on every commit. Include round-trip tests (encode then decode), edge cases (empty, unicode, special tokens), and performance benchmarks. Use a known reference tokenizer (e.g., GPT-2's tokenizer) to validate your implementation on standard texts.
Key Takeaway
Common bugs: wrong merge order, byte vs character confusion, pre-tokenization mismatch, vocabulary size off-by-one, infinite loops. Validate with round-trip tests and edge cases. Profile encoding performance. Use a reference implementation for cross-checking.
● Production incidentPOST-MORTEMseverity: high

The Case of the Vanishing Special Token

Symptom
Model outputs contained unexpected repeated tokens, and loss spiked on validation data. Tokenization of special tokens like [PAD] and [SEP] produced different IDs than expected.
Assumption
The team assumed that reserving the first 5 token IDs for special tokens before BPE training would prevent collisions.
Root cause
The BPE training code did not exclude the reserved IDs from the merge process. During training, the byte sequence for [PAD] (which was a single byte) got merged into a frequent pair, effectively reassigning that ID to a different token.
Fix
Modified the training loop to freeze the first N token IDs (special tokens) and never include them in merge candidates. Also added a validation step that checks special token IDs remain unchanged after training.
Key lesson
  • Always reserve special token IDs before BPE training and exclude them from merges.
  • Add automated tests that verify special token IDs after tokenizer training.
  • Log tokenizer configuration (vocab size, special tokens, merge order) as part of model artifacts.
Production debug guideCommon symptoms and immediate actions for tokenizer problems.4 entries
Symptom · 01
Token IDs differ between training and inference for the same text
Fix
Compare merge tables and base vocabularies. Check Unicode normalization. Verify byte-level encoding consistency.
Symptom · 02
Model generates repeated or nonsensical tokens
Fix
Check if special token IDs are correct. Look for tokenizer vocabulary size mismatch with model embedding layer.
Symptom · 03
High token count per sentence (fragmentation)
Fix
Evaluate vocabulary size vs corpus. Consider retraining with more merges or using a different algorithm.
Symptom · 04
Unknown token (UNK) appearing in output
Fix
If using byte-level BPE, UNK should not appear. Check if tokenizer is character-level. Verify UTF-8 encoding.
★ BPE Tokenizer Quick Debug Cheat SheetThree common tokenizer issues and immediate commands to diagnose them.
Token ID mismatch for same input
Immediate action
Compare tokenizer configs and merge tables
Commands
diff <(cat tokenizer1/merges.txt) <(cat tokenizer2/merges.txt)
python -c "print(open('tokenizer1/vocab.json').read() == open('tokenizer2/vocab.json').read())"
Fix now
Re-train tokenizer with consistent Unicode normalization and seed.
Special token ID collision+
Immediate action
Check reserved IDs in vocab
Commands
python -c "import json; v=json.load(open('vocab.json')); print([k for k in v if k.startswith('[')])"
python -c "v=json.load(open('vocab.json')); print([v[k] for k in v if k.startswith('[')])"
Fix now
Re-train with special tokens frozen and IDs reserved.
Unexpected UNK tokens+
Immediate action
Check if tokenizer is byte-level
Commands
python -c "from tokenizers import Tokenizer; t=Tokenizer.from_file('tokenizer.json'); print(t.decode([0]))"
python -c "print(open('tokenizer.json').read()[:500])"
Fix now
Switch to byte-level BPE or add UNK token handling.
BPE vs Other Tokenization Methods
MethodAlgorithm BasisVocabulary SizeUnicode HandlingCommon Models
BPEFrequency-based mergingFixed (e.g., 50k)Byte-level (UTF-8)GPT-2, GPT-3, GPT-4, RoBERTa
WordPieceLikelihood-based mergingFixed (e.g., 30k)Character-level with UNKBERT, DistilBERT
Unigram LMProbabilistic subword selectionFixed (e.g., 32k)Byte-level via SentencePieceT5, XLNet, Llama
Character-levelSingle charactersSmall (e.g., 128)Direct UnicodeChar-RNN, some small models

Key takeaways

1
BPE builds a vocabulary by merging the most frequent adjacent token pairs iteratively until a target size is reached.
2
Byte-level BPE operates on UTF-8 bytes, guaranteeing coverage of any Unicode text without UNK tokens.
3
The merge table (ordered list of merges) is the core artifact; encoding applies merges in training order.
4
Tokenization is not lossless in meaning—different tokenizers produce different splits for the same text.
5
Production tokenizers require careful handling of special tokens, padding, and batch encoding for efficiency.

Common mistakes to avoid

4 patterns
×

Applying merges in the wrong order during encoding

Symptom
Encoded tokens don't match training output; decoding produces garbled text.
Fix
Always apply merges in the exact order they were learned during training. Use a priority queue or ordered list.
×

Not normalizing Unicode before tokenization

Symptom
Same text with different Unicode normalization forms (e.g., é vs e+combining accent) produces different token sequences.
Fix
Apply consistent Unicode normalization (e.g., NFC) to both training and inference text.
×

Vocabulary size too small for the corpus

Symptom
High token count per sentence, many single-character tokens, poor model performance.
Fix
Increase vocabulary size or use a different tokenization algorithm. Monitor average tokens per word.
×

Forgetting to reserve special tokens

Symptom
Special tokens like [PAD], [UNK], [CLS] collide with learned tokens, causing confusion.
Fix
Reserve the first N token IDs for special tokens before training BPE merges.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how BPE tokenization works, including the training and encoding ...
Q02SENIOR
What are the advantages and disadvantages of byte-level BPE compared to ...
Q03SENIOR
How would you debug a tokenizer that produces different token sequences ...
Q01 of 03SENIOR

Explain how BPE tokenization works, including the training and encoding phases.

ANSWER
BPE training starts with a base vocabulary of all unique bytes/characters in the corpus. It then iteratively finds the most frequent adjacent pair of tokens, merges them into a new token, and adds it to the vocabulary. This repeats until the vocabulary reaches a target size. Encoding applies the learned merges in order: for each merge, scan the text and replace all occurrences of the pair with the merged token. The result is a sequence of token IDs.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between BPE and WordPiece tokenization?
02
Why does GPT-2 use a 50,257 token vocabulary?
03
How does byte-level BPE handle Unicode characters?
04
What are common pitfalls when training a BPE tokenizer?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's From Scratch. Mark it forged?

15 min read · try the examples if you haven't

Previous
Build GPT from Scratch
4 / 4 · From Scratch
Next
scikit-learn Tutorial