Skip to content

BPE Tokenization

Category: Generative AI
Difficulty: Beginner
Time Complexity: O(n × m)
Space Complexity: O(n)

Byte-Pair Encoding (BPE) is the dominant tokenization algorithm used in modern large language models including GPT, LLaMA, and others. Starting from individual characters, BPE iteratively applies learned merge rules that combine the most frequent adjacent token pairs into single tokens. This process produces a subword vocabulary that balances between character-level and word-level representations — common words become single tokens while rare words are split into meaningful subword units. This visualization walks through the merge process step by step, showing how raw text is transformed into the token sequence that a language model actually processes.

{
"text": "lowest",
"mergeRules": [
[
[
"l",
"o"
],
"lo"
],
[
[
"lo",
"w"
],
"low"
],
[
[
"e",
"s"
],
"es"
],
[
[
"es",
"t"
],
"est"
],
[
[
"low",
"est"
],
"lowest"
]
]
}
{
"text": "lowest",
"mergeRules": [
[
[
"l",
"o"
],
"lo"
],
[
[
"lo",
"w"
],
"low"
],
[
[
"e",
"s"
],
"es"
],
[
[
"es",
"t"
],
"est"
],
[
[
"low",
"est"
],
"lowest"
]
]
}
{
"text": "aaabdaaabac",
"mergeRules": [
[
[
"a",
"a"
],
"aa"
],
[
[
"aa",
"a"
],
"aaa"
],
[
[
"a",
"b"
],
"ab"
],
[
[
"aaa",
"b"
],
"aaab"
]
]
}
{
"text": "xyz",
"mergeRules": [
[
[
"a",
"b"
],
"ab"
],
[
[
"c",
"d"
],
"cd"
]
]
}
function tokenizeBPE(text, mergeRules):
tokens = splitIntoCharacters(text)
for each (left, right) → replacement in mergeRules:
i = 0
while i < length(tokens) - 1:
if tokens[i] == left AND tokens[i+1] == right:
tokens = tokens[0..i] + [replacement] + tokens[i+2..]
else:
i = i + 1
return tokens
def tokenize_bpe(text: str, merge_rules: list[tuple[tuple[str, str], str]]) -> list[str]:
tokens = list(text)
for (left, right), replacement in merge_rules:
i = 0
while i < len(tokens) - 1:
if tokens[i] == left and tokens[i + 1] == right:
tokens = tokens[:i] + [replacement] + tokens[i + 2:]
else:
i += 1
return tokens
function tokenizeBPE(text, mergeRules) {
let tokens = [...text];
for (const [[left, right], replacement] of mergeRules) {
let i = 0;
while (i < tokens.length - 1) {
if (tokens[i] === left && tokens[i + 1] === right) {
tokens = [...tokens.slice(0, i), replacement, ...tokens.slice(i + 2)];
} else {
i++;
}
}
}
return tokens;
}

BPE produces tokens that are between characters and full words. Common words like ‘the’ become single tokens, while rare words like ‘unfathomable’ are split into subword pieces like ‘un’, ‘fath’, ‘om’, ‘able’. This gives a fixed-size vocabulary that can represent any text.

The merge rules are learned from a training corpus by repeatedly finding the most frequent adjacent pair and merging it. At inference time, these pre-learned rules are applied in order to tokenize new text. The order of rules matters — earlier rules are applied first.

Each merge rule is applied greedily from left to right across the token sequence. When a matching pair is found, it is merged immediately and the scan continues from the same position (to handle consecutive matches). This deterministic process ensures the same input always produces the same tokens.

More merge rules mean larger vocabulary and shorter token sequences (fewer tokens per text), but require more memory for the embedding table. Fewer rules mean smaller vocabulary but longer sequences. GPT models typically use 50,000-100,000 merge rules.

  • Merge order matters: Applying merge rules in a different order can produce different token sequences. The rules must be applied in the exact order they were learned during training. Shuffling or reordering rules will break tokenization.
  • BPE is not morphological: BPE splits are based purely on statistical frequency, not linguistic morphology. The subword units may not correspond to meaningful morphemes. For example, ‘unhappy’ might split as ‘un’ + ‘happy’ (linguistically meaningful) or as ‘unh’ + ‘appy’ (not meaningful), depending on the training data.
  • Unknown characters: If the input text contains characters not seen during training, BPE cannot merge them and they remain as individual character tokens. Modern implementations use byte-level BPE to handle any input byte, avoiding true out-of-vocabulary situations.

Q1: Given tokens [‘l’, ‘o’, ‘w’] and merge rule ([‘l’, ‘o’], ‘lo’), what is the result?

  • A) [‘lo’, ‘w’]
  • B) [‘l’, ‘ow’]
  • C) [‘low’]
  • D) [‘l’, ‘o’, ‘w’]
Show answer

Answer: A) [‘lo’, ‘w’]

The merge rule combines ‘l’ and ‘o’ into ‘lo’. The ‘w’ is not part of this merge and remains separate. Result: [‘lo’, ‘w’].

Q2: Why are BPE merge rules applied in a specific order?

  • A) For computational efficiency only
  • B) Because the order was learned from training data and affects the output
  • C) The order doesn’t matter — any order gives the same result
  • D) To ensure alphabetical token ordering
Show answer

Answer: B) Because the order was learned from training data and affects the output

Merge rules are learned in order of frequency from the training corpus. Applying them in a different order can produce different tokenizations because earlier merges change which pairs are available for later merges.

Q3: What is the initial token sequence for the text ‘hello’?

  • A) [‘hello’]
  • B) [‘hel’, ‘lo’]
  • C) [‘h’, ‘e’, ‘l’, ‘l’, ‘o’]
  • D) [‘he’, ‘ll’, ‘o’]
Show answer

Answer: C) [‘h’, ‘e’, ‘l’, ‘l’, ‘o’]

BPE always starts by splitting the input text into individual characters. Each character becomes its own token. Merge rules are then applied to combine adjacent tokens.