BPE Tokenization
Category: Generative AI
Difficulty: Beginner
Time Complexity: O(n × m)
Space Complexity: O(n)
Overview
Section titled “Overview”Byte-Pair Encoding (BPE) is the dominant tokenization algorithm used in modern large language models including GPT, LLaMA, and others. Starting from individual characters, BPE iteratively applies learned merge rules that combine the most frequent adjacent token pairs into single tokens. This process produces a subword vocabulary that balances between character-level and word-level representations — common words become single tokens while rare words are split into meaningful subword units. This visualization walks through the merge process step by step, showing how raw text is transformed into the token sequence that a language model actually processes.
Try It
Section titled “Try It”- Web: Open in Eigenvue →
- Python:
import eigenvueeigenvue.show("tokenization-bpe")
Default Inputs
Section titled “Default Inputs”{ "text": "lowest", "mergeRules": [ [ [ "l", "o" ], "lo" ], [ [ "lo", "w" ], "low" ], [ [ "e", "s" ], "es" ], [ [ "es", "t" ], "est" ], [ [ "low", "est" ], "lowest" ] ]}Input Examples
Section titled “Input Examples”Default — “lowest”
Section titled “Default — “lowest””{ "text": "lowest", "mergeRules": [ [ [ "l", "o" ], "lo" ], [ [ "lo", "w" ], "low" ], [ [ "e", "s" ], "es" ], [ [ "es", "t" ], "est" ], [ [ "low", "est" ], "lowest" ] ]}Simple — “ab”
Section titled “Simple — “ab””{ "text": "aaabdaaabac", "mergeRules": [ [ [ "a", "a" ], "aa" ], [ [ "aa", "a" ], "aaa" ], [ [ "a", "b" ], "ab" ], [ [ "aaa", "b" ], "aaab" ] ]}No merges apply
Section titled “No merges apply”{ "text": "xyz", "mergeRules": [ [ [ "a", "b" ], "ab" ], [ [ "c", "d" ], "cd" ] ]}Pseudocode
Section titled “Pseudocode”function tokenizeBPE(text, mergeRules): tokens = splitIntoCharacters(text) for each (left, right) → replacement in mergeRules: i = 0 while i < length(tokens) - 1: if tokens[i] == left AND tokens[i+1] == right: tokens = tokens[0..i] + [replacement] + tokens[i+2..] else: i = i + 1 return tokensPython
Section titled “Python”def tokenize_bpe(text: str, merge_rules: list[tuple[tuple[str, str], str]]) -> list[str]: tokens = list(text) for (left, right), replacement in merge_rules: i = 0 while i < len(tokens) - 1: if tokens[i] == left and tokens[i + 1] == right: tokens = tokens[:i] + [replacement] + tokens[i + 2:] else: i += 1 return tokensJavaScript
Section titled “JavaScript”function tokenizeBPE(text, mergeRules) { let tokens = [...text]; for (const [[left, right], replacement] of mergeRules) { let i = 0; while (i < tokens.length - 1) { if (tokens[i] === left && tokens[i + 1] === right) { tokens = [...tokens.slice(0, i), replacement, ...tokens.slice(i + 2)]; } else { i++; } } } return tokens;}Key Concepts
Section titled “Key Concepts”Subword Tokenization
Section titled “Subword Tokenization”BPE produces tokens that are between characters and full words. Common words like ‘the’ become single tokens, while rare words like ‘unfathomable’ are split into subword pieces like ‘un’, ‘fath’, ‘om’, ‘able’. This gives a fixed-size vocabulary that can represent any text.
Merge Rules Are Learned
Section titled “Merge Rules Are Learned”The merge rules are learned from a training corpus by repeatedly finding the most frequent adjacent pair and merging it. At inference time, these pre-learned rules are applied in order to tokenize new text. The order of rules matters — earlier rules are applied first.
Greedy Left-to-Right Application
Section titled “Greedy Left-to-Right Application”Each merge rule is applied greedily from left to right across the token sequence. When a matching pair is found, it is merged immediately and the scan continues from the same position (to handle consecutive matches). This deterministic process ensures the same input always produces the same tokens.
Vocabulary Size Trade-off
Section titled “Vocabulary Size Trade-off”More merge rules mean larger vocabulary and shorter token sequences (fewer tokens per text), but require more memory for the embedding table. Fewer rules mean smaller vocabulary but longer sequences. GPT models typically use 50,000-100,000 merge rules.
Common Pitfalls
Section titled “Common Pitfalls”- Merge order matters: Applying merge rules in a different order can produce different token sequences. The rules must be applied in the exact order they were learned during training. Shuffling or reordering rules will break tokenization.
- BPE is not morphological: BPE splits are based purely on statistical frequency, not linguistic morphology. The subword units may not correspond to meaningful morphemes. For example, ‘unhappy’ might split as ‘un’ + ‘happy’ (linguistically meaningful) or as ‘unh’ + ‘appy’ (not meaningful), depending on the training data.
- Unknown characters: If the input text contains characters not seen during training, BPE cannot merge them and they remain as individual character tokens. Modern implementations use byte-level BPE to handle any input byte, avoiding true out-of-vocabulary situations.
Q1: Given tokens [‘l’, ‘o’, ‘w’] and merge rule ([‘l’, ‘o’], ‘lo’), what is the result?
- A) [‘lo’, ‘w’]
- B) [‘l’, ‘ow’]
- C) [‘low’]
- D) [‘l’, ‘o’, ‘w’]
Show answer
Answer: A) [‘lo’, ‘w’]
The merge rule combines ‘l’ and ‘o’ into ‘lo’. The ‘w’ is not part of this merge and remains separate. Result: [‘lo’, ‘w’].
Q2: Why are BPE merge rules applied in a specific order?
- A) For computational efficiency only
- B) Because the order was learned from training data and affects the output
- C) The order doesn’t matter — any order gives the same result
- D) To ensure alphabetical token ordering
Show answer
Answer: B) Because the order was learned from training data and affects the output
Merge rules are learned in order of frequency from the training corpus. Applying them in a different order can produce different tokenizations because earlier merges change which pairs are available for later merges.
Q3: What is the initial token sequence for the text ‘hello’?
- A) [‘hello’]
- B) [‘hel’, ‘lo’]
- C) [‘h’, ‘e’, ‘l’, ‘l’, ‘o’]
- D) [‘he’, ‘ll’, ‘o’]
Show answer
Answer: C) [‘h’, ‘e’, ‘l’, ‘l’, ‘o’]
BPE always starts by splitting the input text into individual characters. Each character becomes its own token. Merge rules are then applied to combine adjacent tokens.
Further Reading
Section titled “Further Reading”- Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016) (article)
- Byte-Pair Encoding — Wikipedia (reference)
- Hugging Face Tokenizers Documentation (reference)