Build a BPE Tokenizer from Scratch: Production-Grade Implementation
Learn to build a Byte-Pair Encoding tokenizer from scratch with Python.
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
- BPE tokenizes text by iteratively merging the most frequent adjacent byte pairs into new tokens.
- The algorithm starts with a base vocabulary of all unique bytes/characters and merges up to a target size.
- Byte-level BPE handles any UTF-8 text by operating on raw bytes, avoiding unknown token issues.
- Tokenizer training is separate from encoding; the merge table is the core artifact.
- Production tokenizers must handle special tokens, padding, and efficient batch encoding.
- Common pitfalls include incorrect merge order, vocabulary size mismatch, and Unicode normalization.
Imagine you have a huge Lego set with only tiny 1x1 bricks. BPE is like gluing the most common pairs of bricks together into bigger pieces, then gluing those, until you have a set of useful larger bricks. This makes building (encoding) faster and more efficient, especially for common patterns like 'the' or 'ing'.
Every large language model you've interacted with—GPT-4, Claude, Llama 3—relies on a tokenizer to convert raw text into a sequence of integers. That tokenizer is almost always Byte-Pair Encoding (BPE). Understanding BPE isn't just academic trivia; it's a practical skill that directly impacts model performance, vocabulary efficiency, and debugging production issues like tokenization mismatches between training and inference.
While libraries like Hugging Face's tokenizers abstract away the details, building BPE from scratch forces you to confront the real trade-offs: how do you handle Unicode? What happens when a rare character appears at inference? Why does GPT-2 use a 50,257 token vocabulary while Llama 3 uses 128,000? These decisions have concrete consequences for model quality and throughput.
In this article, we'll implement a byte-level BPE tokenizer in pure Python, step by step. We'll start with the original compression algorithm from Philip Gage's 1994 paper, then adapt it for modern LLM tokenization. You'll learn to train the tokenizer, encode and decode text, and handle edge cases like whitespace, punctuation, and special tokens.
By the end, you'll have a production-ready mental model of BPE—not just how it works, but why it works that way, and how to debug it when things go wrong. This is the foundation every serious ML engineer needs before trusting a tokenizer in a critical pipeline.
What is Byte-Pair Encoding? The Original 1994 Compression Algorithm
Byte-Pair Encoding (BPE) was introduced by Philip Gage in 1994 as a simple, adaptive compression algorithm. The core idea is to iteratively replace the most frequent pair of adjacent bytes in a sequence with a single unused byte. This process continues until no pair appears more than once, effectively compressing the data by exploiting local redundancy. The algorithm requires a lookup table to map each replacement byte back to its original pair for decompression.
Consider the string "aaabdaaabac". The byte pair "aa" appears three times, more than any other pair. We replace every occurrence with an unused byte, say 'Z', yielding "ZabdZabac" and a table entry Z=aa. Next, "ab" appears twice, so we replace it with 'Y', giving "ZYdZYac" with Y=ab. Finally, "ZY" appears twice, replaced by 'X', producing "XdXac" with X=ZY. The compressed result is shorter, and decompression reverses the replacements in reverse order.
Gage's original algorithm was designed for general-purpose lossless compression, not for language modeling. It operates on raw bytes, not characters or tokens, and aims for maximal compression. The algorithm is greedy and deterministic: at each step, it picks the most frequent pair, regardless of context. This simplicity makes it fast and easy to implement, but it also means the compression ratio is modest compared to modern algorithms like LZ77 or arithmetic coding.
Despite its age, BPE remains relevant because of its adaptability. It doesn't require a fixed dictionary; instead, it builds one dynamically from the data. This property makes it ideal for domains where the byte distribution is unknown or changes, such as in text compression for specific corpora. However, the original algorithm has a hard limitation: it can only compress data that fits within the byte range (0-255), and it cannot handle inputs larger than 256 unique symbols without running out of replacement bytes.
From Compression to Tokenization: Adapting BPE for Language Models
The modified BPE algorithm used in language models (e.g., GPT, BERT, RoBERTa) differs fundamentally from Gage's original. Instead of maximizing compression, it aims to build a fixed-size vocabulary of subword tokens that can represent any text efficiently. The key change: we stop merging when the vocabulary reaches a predefined size, not when compression is maximal. This allows the model to handle rare and unseen words by decomposing them into known subwords.
In the modified version, we start with a base vocabulary of all unique characters (or bytes) in the training corpus. Each character is assigned a token ID. Then we iteratively find the most frequent pair of adjacent tokens and merge them into a new token, adding it to the vocabulary. This continues until the vocabulary reaches the desired size (e.g., 50,257 for GPT-2, 100,258 for GPT-4). The result is a set of tokens that range from single characters to common multi-character sequences like "ing" or "the".
Consider the same example "aaabdaaabac" with a target vocabulary size of 6. Initial tokens: a=0, b=1, d=2, c=3. After merging "aa" -> 4, we have tokens 4,0,1,2,4,0,1,0,3. Next, merge "ab" -> 5, yielding 4,5,2,4,5,0,3. The vocabulary now has 6 tokens (0-5), so we stop. The encoded sequence is not maximally compressed (it's longer than the original), but it uses only tokens from the learned vocabulary. If we set size to 8, we'd continue merging "4,5" -> 6 and "4,5,2" -> 7, getting even longer tokens.
The critical insight: modified BPE trades compression efficiency for tokenization flexibility. By fixing the vocabulary size, we ensure the model's embedding layer has a manageable number of parameters. The algorithm also guarantees that any input text can be tokenized, because the base character set covers all possible inputs. This is why BPE became the de facto standard for neural language models: it balances vocabulary size with coverage, avoiding the need for a separate UNK token for rare characters.
Byte-Level BPE: Handling Unicode and Avoiding UNK Tokens
A fundamental problem with character-level BPE is that it cannot handle characters outside its training set. If you train only on English text, the tokenizer will fail on Chinese, Arabic, or emoji. The naive solution is to map unknown characters to a special UNK token, but this loses information and degrades model performance on multilingual or noisy inputs. Byte-level BPE solves this by operating on UTF-8 bytes instead of Unicode characters.
UTF-8 encoding represents every Unicode code point as a sequence of 1 to 4 bytes. By training BPE on byte sequences, the tokenizer learns to merge bytes into tokens, not characters. Since there are only 256 possible byte values (0-255), the base vocabulary is fixed and universal. Any Unicode string can be encoded as bytes, so the tokenizer never encounters an unknown input. This is the approach used by GPT-2, GPT-3, GPT-4, and many modern LLMs.
The trade-off is that byte-level BPE may split a single character across multiple tokens. For example, the emoji '😀' (U+1F600) is encoded as 4 bytes: F0 9F 98 80. A byte-level tokenizer might merge some of these bytes into a token, but it could also split them. This means the model must learn that certain byte sequences form meaningful characters. In practice, this works well because the model's attention mechanism can capture these patterns.
Another advantage: byte-level BPE naturally handles mixed-script text. A single tokenizer can process English, Chinese, and emoji without any special handling. It also avoids the need for a separate normalization step (like NFKC) because UTF-8 is already a canonical byte representation. However, it does increase the average token length for non-ASCII text: a Chinese character (3 bytes) might become 2-3 tokens instead of 1, increasing sequence length and computational cost.
Building a BPE Tokenizer from Scratch: Step-by-Step Implementation in Python
We'll implement a complete byte-level BPE tokenizer from scratch in Python. The implementation follows the algorithm used by GPT-2: start with a base vocabulary of 256 byte tokens, then iteratively merge the most frequent pair of adjacent tokens until reaching the desired vocabulary size. We'll include training, encoding, and decoding functions. The code is intentionally minimal and educational — no optimizations for speed.
Step 1: Initialize with byte-level base vocabulary. We map each byte value (0-255) to its corresponding token ID. Step 2: Train on a corpus by counting adjacent token pairs, finding the most frequent, and merging them. We store the merge operations in a dictionary for later encoding. Step 3: Encode new text by applying the learned merges greedily: start with byte IDs, then repeatedly replace pairs that appear in the merge table. Step 4: Decode by mapping token IDs back to bytes and converting to UTF-8.
The critical data structure is the merge table: a dictionary mapping a pair of token IDs to a new token ID. During encoding, we iterate through the sequence, and for each position, check if the current pair is in the merge table. If so, we replace it with the merged token and continue. This greedy left-to-right encoding is not optimal (it might miss some merges), but it's deterministic and matches the training procedure.
Performance considerations: The naive implementation has O(n * m) complexity for encoding, where n is sequence length and m is number of merges. In production, tokenizers use a trie or regex-based approach for O(n) encoding. Our implementation is fine for learning but not for production use. For real applications, use Hugging Face's tokenizers library or OpenAI's tiktoken.
Training the Tokenizer: Counting Frequencies, Merging Pairs, Building the Vocabulary
Training a BPE tokenizer is an iterative statistical process. You start with a corpus of text, split it into individual characters (or bytes for byte-level BPE), and treat each unique character as an initial token. The core loop: count all adjacent token pairs, find the most frequent pair, merge every occurrence of that pair into a single new token, add that new token to the vocabulary, and repeat until you reach a predefined vocabulary size. This is not compression; it's building a fixed-size set of subword units that balance character-level flexibility and word-level efficiency.
The merge operation is the heart of the algorithm. If you have the pair ('a', 'b') occurring 1500 times, you replace every 'a' followed by 'b' with the merged token 'ab'. Crucially, after a merge, the frequencies of all adjacent pairs involving the new token must be recomputed. A naive implementation recalculates pair frequencies from scratch after each merge, which is O(V * N) and painfully slow for large corpora. A production approach uses a priority queue or incremental frequency updates: when you merge a pair, only the pairs that include the left neighbor of the new token and the right neighbor of the new token can change. You decrement the old pair counts and increment the new ones, then push the updated counts into a max-heap.
The vocabulary itself is a mapping from token ID (integer) to the token's byte sequence. During training, you also maintain a list of merge rules: each merge is a tuple (left_id, right_id) that produces a new token ID. This list is the tokenizer's learned knowledge. For a vocabulary size of, say, 50,257 (GPT-2), you start with ~256 byte tokens and perform ~50,000 merges. The final vocabulary contains the original bytes plus all merged subwords. The merge rules are applied in the order they were learned during encoding.
A critical detail: you must decide on the base alphabet. Character-level BPE fails on unseen characters. Byte-level BPE (encoding text as UTF-8 bytes) guarantees that any Unicode string can be tokenized, because every character decomposes into 1-4 bytes in the range 0-255. This is what GPT-2/3/4 use. The trade-off is that a single character may become multiple tokens, increasing sequence length. For English, this is rare; for CJK characters, it's common. You must also decide whether to include whitespace as a token or to pre-tokenize with a regex (e.g., GPT-2's 'gpt2' pattern splits on punctuation and spaces). Pre-tokenization is a separate step that runs before BPE merges and dramatically affects the tokenizer's behavior.
Encoding and Decoding: Applying Merges in Order, Handling Edge Cases
Encoding is the process of converting raw text into a sequence of token IDs using the learned merge rules. The algorithm is deterministic: you split the text into bytes (or pre-tokenized chunks), then repeatedly scan the sequence for the earliest merge rule that applies. In practice, you apply merges in the order they were learned. For each merge rule (a, b) -> new_id, you walk through the current token sequence and replace every adjacent pair (a, b) with new_id. This is O(M * N) where M is the number of merges and N is sequence length. For efficiency, you can build a trie or use a greedy longest-match approach, but the canonical method is sequential merge application.
Edge cases are where bugs hide. Unknown characters: if you use byte-level BPE, every character is representable as bytes, so there are no unknown tokens. If you used character-level, you need an <UNK> token. Whitespace handling: if you pre-tokenized with a regex that splits on spaces, you must apply the same regex during encoding. A common mistake is to forget that the regex splits the text into 'words', and BPE merges only happen within each word. This means spaces are never merged with adjacent tokens, which is intentional to preserve word boundaries.
Decoding is trivial: you have a vocabulary mapping token_id -> bytes. Concatenate the bytes for each token in order, then decode the byte sequence to a string using UTF-8. No merge rules are needed. However, you must handle the case where a byte sequence is invalid UTF-8 (e.g., if a token ends in the middle of a multi-byte character). In practice, this rarely happens because BPE merges whole byte sequences, but you should use errors='replace' in Python's decode to avoid crashes.
Another edge case: the input text may contain characters that, when encoded as bytes, produce sequences that match merge rules across pre-tokenized boundaries. This is prevented by applying merges only within each pre-tokenized chunk. If you skip pre-tokenization, you might merge a space with a letter, creating tokens that span word boundaries, which is generally undesirable for language models.
Performance tip: for encoding, you can precompute a lookup table that maps each possible byte pair to the merge priority. Then encoding becomes a single pass: for each position, check if the pair at (i, i+1) has a merge rule, and if so, apply it greedily. This is the approach used in the HuggingFace tokenizers library, and it's O(N) after building the merge map.
Production Considerations: Special Tokens, Batch Encoding, and Performance
In production, a tokenizer is not just a script—it's a service. You need to handle special tokens: <PAD>, <UNK>, <BOS>, <EOS>, <SEP>, <CLS>, and user-defined tokens. These must be assigned fixed IDs that never collide with BPE-generated tokens. The standard practice is to reserve the first N IDs (e.g., 0-255 for bytes, then 256-... for specials, then merges). GPT-4 reserves 258 special tokens before BPE merges. During encoding, you must ensure that special tokens are not split by BPE. This is done by pre-tokenizing the input to isolate special token strings and mapping them directly to their IDs, bypassing the BPE merge logic.
Batch encoding is critical for throughput. You should never tokenize one string at a time in a loop. Instead, process a batch of strings together, padding to the same length. The tokenizer should return attention masks indicating which positions are real tokens vs. padding. For variable-length batches, you can either pad to the longest sequence in the batch or use dynamic batching with a maximum length. The latter is more efficient for transformer models.
Performance bottlenecks: Python's pure BPE implementation is slow for large-scale use. The HuggingFace tokenizers library (written in Rust) can tokenize millions of sentences per second. If you must use Python, use numpy vectorization where possible, and avoid per-token Python loops. For example, you can precompute a byte-to-token mapping and use array operations for the first pass. However, the merge step is inherently sequential and hard to vectorize. A common optimization is to limit the number of merge passes: instead of applying all merges until no more apply, apply merges in a fixed number of iterations (e.g., 10) and accept that some merges may be missed. This trades accuracy for speed.
Memory management: the vocabulary can be large (100k+ tokens). Store it as a dictionary mapping int -> bytes. For fast encoding, also store a reverse mapping from bytes to token ID for the base tokens. For merges, store a dictionary of (left_id, right_id) -> new_id. Avoid storing the full merge list as a Python list of tuples; use a precomputed lookup table (e.g., a 256x256 array for byte pairs, then expand dynamically).
Caching: if you have a fixed set of input strings (e.g., in a dataset preprocessing pipeline), cache the tokenized outputs. Use a hash of the input string as the cache key. This can save hours of repeated tokenization during training epochs. Also, consider using a LRU cache for frequently seen strings in an inference server.
Debugging and Validating Your Tokenizer: Common Pitfalls and How to Fix Them
The most common bug in BPE tokenizers is incorrect merge order during encoding. If you apply merges in a different order than they were learned, the tokenization will be wrong. Always store merges as a list in the exact order they were created, and during encoding, iterate through that list sequentially. A quick validation: tokenize a known string, decode it, and check that you get back the original string. If not, your merge order is off or your merge map is incorrect.
Another frequent issue is byte-level vs. character-level confusion. If you trained on bytes but try to encode a string directly as characters, you'll get IndexError or wrong tokens. Always encode the input string to UTF-8 bytes first, then convert to integers. Conversely, if you trained on characters, you'll fail on unseen characters. The fix: use byte-level BPE for any production system that needs to handle arbitrary Unicode.
Pre-tokenization mismatches are subtle. If you used a regex to split on whitespace during training, you must use the exact same regex during encoding. A common mistake is to use a different regex or to forget pre-tokenization entirely. This leads to tokens that span word boundaries, which can confuse the language model. Validate by checking that spaces are never merged with adjacent tokens in the encoded output.
Vocabulary size off-by-one: if you specify vocab_size=1000, you should end up with exactly 1000 tokens (including base tokens and special tokens). Count your base tokens (256 for bytes) and special tokens, then ensure the number of merges is vocab_size - base_tokens - special_tokens. If you have fewer merges, your training loop stopped early (e.g., no more frequent pairs). If you have more, you have a bug in the merge counting.
Performance debugging: if encoding is slow, profile the merge application loop. The naive implementation that scans the entire sequence for each merge is O(M*N). Use a merge map and apply all possible merges in a single pass until convergence. If it's still slow, limit the number of merge iterations (e.g., max 10 passes). This may leave some merges unapplied, but for most texts, the difference is negligible.
Finally, test with edge cases: empty string, single character, repeated characters, Unicode emojis, very long strings (100k+ characters), strings with only special tokens. Your tokenizer should handle all of these without crashing or producing infinite loops. For infinite loops, add a maximum iteration counter to the encoding loop.
The Case of the Vanishing Special Token
- Always reserve special token IDs before BPE training and exclude them from merges.
- Add automated tests that verify special token IDs after tokenizer training.
- Log tokenizer configuration (vocab size, special tokens, merge order) as part of model artifacts.
diff <(cat tokenizer1/merges.txt) <(cat tokenizer2/merges.txt)python -c "print(open('tokenizer1/vocab.json').read() == open('tokenizer2/vocab.json').read())"Key takeaways
Common mistakes to avoid
4 patternsApplying merges in the wrong order during encoding
Not normalizing Unicode before tokenization
Vocabulary size too small for the corpus
Forgetting to reserve special tokens
Interview Questions on This Topic
Explain how BPE tokenization works, including the training and encoding phases.
Frequently Asked Questions
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
That's From Scratch. Mark it forged?
15 min read · try the examples if you haven't