BPE Tokenization

Category: Generative AI
Difficulty: Beginner
Time Complexity: O(n × m)
Space Complexity: O(n)

Overview

Byte-Pair Encoding (BPE) is the dominant tokenization algorithm used in modern large language models including GPT, LLaMA, and others. Starting from individual characters, BPE iteratively applies learned merge rules that combine the most frequent adjacent token pairs into single tokens. This process produces a subword vocabulary that balances between character-level and word-level representations — common words become single tokens while rare words are split into meaningful subword units. This visualization walks through the merge process step by step, showing how raw text is transformed into the token sequence that a language model actually processes.

Try It

Web: Open in Eigenvue →

Python:

import eigenvue
eigenvue.show("tokenization-bpe")

Default Inputs

{
  "text": "lowest",
  "mergeRules": [
    [
      [
        "l",
        "o"
      ],
      "lo"
    ],
    [
      [
        "lo",
        "w"
      ],
      "low"
    ],
    [
      [
        "e",
        "s"
      ],
      "es"
    ],
    [
      [
        "es",
        "t"
      ],
      "est"
    ],
    [
      [
        "low",
        "est"
      ],
      "lowest"
    ]
  ]
}

Input Examples

Default — “lowest”

{
  "text": "lowest",
  "mergeRules": [
    [
      [
        "l",
        "o"
      ],
      "lo"
    ],
    [
      [
        "lo",
        "w"
      ],
      "low"
    ],
    [
      [
        "e",
        "s"
      ],
      "es"
    ],
    [
      [
        "es",
        "t"
      ],
      "est"
    ],
    [
      [
        "low",
        "est"
      ],
      "lowest"
    ]
  ]
}

Simple — “ab”

{
  "text": "aaabdaaabac",
  "mergeRules": [
    [
      [
        "a",
        "a"
      ],
      "aa"
    ],
    [
      [
        "aa",
        "a"
      ],
      "aaa"
    ],
    [
      [
        "a",
        "b"
      ],
      "ab"
    ],
    [
      [
        "aaa",
        "b"
      ],
      "aaab"
    ]
  ]
}

No merges apply

{
  "text": "xyz",
  "mergeRules": [
    [
      [
        "a",
        "b"
      ],
      "ab"
    ],
    [
      [
        "c",
        "d"
      ],
      "cd"
    ]
  ]
}

Code

Pseudocode

function tokenizeBPE(text, mergeRules):
  tokens = splitIntoCharacters(text)
  for each (left, right) → replacement in mergeRules:
    i = 0
    while i < length(tokens) - 1:
      if tokens[i] == left AND tokens[i+1] == right:
        tokens = tokens[0..i] + [replacement] + tokens[i+2..]
      else:
        i = i + 1
  return tokens

Python

def tokenize_bpe(text: str, merge_rules: list[tuple[tuple[str, str], str]]) -> list[str]:
    tokens = list(text)
    for (left, right), replacement in merge_rules:
        i = 0
        while i < len(tokens) - 1:
            if tokens[i] == left and tokens[i + 1] == right:
                tokens = tokens[:i] + [replacement] + tokens[i + 2:]
            else:
                i += 1
    return tokens

JavaScript

function tokenizeBPE(text, mergeRules) {
  let tokens = [...text];
  for (const [[left, right], replacement] of mergeRules) {
    let i = 0;
    while (i < tokens.length - 1) {
      if (tokens[i] === left && tokens[i + 1] === right) {
        tokens = [...tokens.slice(0, i), replacement, ...tokens.slice(i + 2)];
      } else {
        i++;
      }
    }
  }
  return tokens;
}

Key Concepts

Subword Tokenization

BPE produces tokens that are between characters and full words. Common words like ‘the’ become single tokens, while rare words like ‘unfathomable’ are split into subword pieces like ‘un’, ‘fath’, ‘om’, ‘able’. This gives a fixed-size vocabulary that can represent any text.

Merge Rules Are Learned

The merge rules are learned from a training corpus by repeatedly finding the most frequent adjacent pair and merging it. At inference time, these pre-learned rules are applied in order to tokenize new text. The order of rules matters — earlier rules are applied first.

Greedy Left-to-Right Application

Each merge rule is applied greedily from left to right across the token sequence. When a matching pair is found, it is merged immediately and the scan continues from the same position (to handle consecutive matches). This deterministic process ensures the same input always produces the same tokens.

Vocabulary Size Trade-off

More merge rules mean larger vocabulary and shorter token sequences (fewer tokens per text), but require more memory for the embedding table. Fewer rules mean smaller vocabulary but longer sequences. GPT models typically use 50,000-100,000 merge rules.

Common Pitfalls

Merge order matters: Applying merge rules in a different order can produce different token sequences. The rules must be applied in the exact order they were learned during training. Shuffling or reordering rules will break tokenization.
BPE is not morphological: BPE splits are based purely on statistical frequency, not linguistic morphology. The subword units may not correspond to meaningful morphemes. For example, ‘unhappy’ might split as ‘un’ + ‘happy’ (linguistically meaningful) or as ‘unh’ + ‘appy’ (not meaningful), depending on the training data.
Unknown characters: If the input text contains characters not seen during training, BPE cannot merge them and they remain as individual character tokens. Modern implementations use byte-level BPE to handle any input byte, avoiding true out-of-vocabulary situations.

Quiz

Q1: Given tokens [‘l’, ‘o’, ‘w’] and merge rule ([‘l’, ‘o’], ‘lo’), what is the result?

A) [‘lo’, ‘w’]
B) [‘l’, ‘ow’]
C) [‘low’]
D) [‘l’, ‘o’, ‘w’]

Show answer

Answer: A) [‘lo’, ‘w’]

The merge rule combines ‘l’ and ‘o’ into ‘lo’. The ‘w’ is not part of this merge and remains separate. Result: [‘lo’, ‘w’].

Q2: Why are BPE merge rules applied in a specific order?

A) For computational efficiency only
B) Because the order was learned from training data and affects the output
C) The order doesn’t matter — any order gives the same result
D) To ensure alphabetical token ordering

Show answer

Answer: B) Because the order was learned from training data and affects the output

Merge rules are learned in order of frequency from the training corpus. Applying them in a different order can produce different tokenizations because earlier merges change which pairs are available for later merges.

Q3: What is the initial token sequence for the text ‘hello’?

A) [‘hello’]
B) [‘hel’, ‘lo’]
C) [‘h’, ‘e’, ‘l’, ‘l’, ‘o’]
D) [‘he’, ‘ll’, ‘o’]

Show answer

Answer: C) [‘h’, ‘e’, ‘l’, ‘l’, ‘o’]

BPE always starts by splitting the input text into individual characters. Each character becomes its own token. Merge rules are then applied to combine adjacent tokens.

BPE Tokenization

Overview

Try It

Default Inputs

Input Examples

Default — “lowest”

Simple — “ab”

No merges apply

Code

Pseudocode

Python

JavaScript

Key Concepts

Subword Tokenization

Merge Rules Are Learned

Greedy Left-to-Right Application

Vocabulary Size Trade-off

Common Pitfalls

Quiz

Further Reading

Related Algorithms