About BPE Tokenization

Byte-Pair Encoding (BPE) is the dominant tokenization algorithm used in modern large language models including GPT, LLaMA, and others. Starting from individual characters, BPE iteratively applies learned merge rules that combine the most frequent adjacent token pairs into single tokens. This process produces a subword vocabulary that balances between character-level and word-level representations — common words become single tokens while rare words are split into meaningful subword units. This visualization walks through the merge process step by step, showing how raw text is transformed into the token sequence that a language model actually processes.

Complexity Analysis

Time Complexity
O(n × m)
Space Complexity
O(n)
Difficulty
beginner

Key Concepts

Subword Tokenization

BPE produces tokens that are between characters and full words. Common words like 'the' become single tokens, while rare words like 'unfathomable' are split into subword pieces like 'un', 'fath', 'om', 'able'. This gives a fixed-size vocabulary that can represent any text.

Merge Rules Are Learned

The merge rules are learned from a training corpus by repeatedly finding the most frequent adjacent pair and merging it. At inference time, these pre-learned rules are applied in order to tokenize new text. The order of rules matters — earlier rules are applied first.

Greedy Left-to-Right Application

Each merge rule is applied greedily from left to right across the token sequence. When a matching pair is found, it is merged immediately and the scan continues from the same position (to handle consecutive matches). This deterministic process ensures the same input always produces the same tokens.

Vocabulary Size Trade-off

More merge rules mean larger vocabulary and shorter token sequences (fewer tokens per text), but require more memory for the embedding table. Fewer rules mean smaller vocabulary but longer sequences. GPT models typically use 50,000-100,000 merge rules.

Common Pitfalls

Merge order matters

Applying merge rules in a different order can produce different token sequences. The rules must be applied in the exact order they were learned during training. Shuffling or reordering rules will break tokenization.

BPE is not morphological

BPE splits are based purely on statistical frequency, not linguistic morphology. The subword units may not correspond to meaningful morphemes. For example, 'unhappy' might split as 'un' + 'happy' (linguistically meaningful) or as 'unh' + 'appy' (not meaningful), depending on the training data.

Unknown characters

If the input text contains characters not seen during training, BPE cannot merge them and they remain as individual character tokens. Modern implementations use byte-level BPE to handle any input byte, avoiding true out-of-vocabulary situations.