Skip to content

Token Embeddings

Category: Generative AI
Difficulty: Intermediate
Time Complexity: O(n × d)
Space Complexity: O(V × d)

Token embeddings are the first step in any transformer-based language model. Each token in the vocabulary is assigned a dense vector of real numbers — its embedding. These vectors are learned during training so that semantically similar tokens end up with similar vectors (high cosine similarity). This visualization shows the embedding lookup process step by step: given a sequence of tokens, each token is mapped to its embedding vector, and pairwise cosine similarities are computed to illustrate the geometric relationships between tokens in the embedding space.

{
"tokens": [
"The",
"cat",
"sat"
],
"embeddingDim": 4
}
{
"tokens": [
"The",
"cat",
"sat"
],
"embeddingDim": 4
}
{
"tokens": [
"king",
"queen",
"man",
"woman"
],
"embeddingDim": 6
}
function embedTokens(tokens, embeddingTable, d):
embeddings = empty matrix [len(tokens), d]
for i = 0 to len(tokens) - 1:
tokenId = lookupId(tokens[i])
embeddings[i] = embeddingTable[tokenId] // d-dimensional vector
return embeddings
// Cosine similarity: cos(θ) = (A · B) / (‖A‖ × ‖B‖)
import numpy as np
def embed_tokens(tokens: list[str], embedding_table: np.ndarray, token_to_id: dict[str, int]) -> np.ndarray:
"""Look up embeddings for a list of tokens."""
ids = [token_to_id[t] for t in tokens]
return embedding_table[ids] # shape: [len(tokens), d]
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

An embedding layer is essentially a lookup table. Each token ID maps to a row in the embedding matrix. There is no computation involved — just an index-based retrieval of a pre-stored vector. The embedding matrix has shape [V, d] where V is the vocabulary size and d is the embedding dimension.

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It is computed as cos(θ) = (A · B) / (‖A‖ × ‖B‖) and ranges from -1 (opposite directions) to +1 (same direction). In embedding space, high cosine similarity between two token vectors suggests semantic relatedness.

The embedding dimension d determines the expressiveness of token representations. Small models use d=64 or d=128, while GPT-3 uses d=12288. Higher dimensions can capture more nuanced semantic relationships but require more memory and computation. The choice of d is a key architectural hyperparameter.

Embedding vectors are not hand-crafted — they are learned during model training via backpropagation. The training process adjusts vectors so that tokens appearing in similar contexts end up with similar embeddings. This is the distributional hypothesis: ‘a word is characterized by the company it keeps.‘

  • Embeddings are learned, not fixed: A common misconception is that embeddings are predetermined or based on character similarity. In reality, the embedding for each token is learned during training. The token ‘cat’ and ‘cat’ (same string) will always have the same embedding, but ‘cat’ and ‘bat’ may have very different embeddings despite differing by one character.
  • Confusing embedding dimension with vocabulary size: The embedding matrix has shape [V, d] where V is the vocabulary size (number of unique tokens) and d is the embedding dimension (vector length). These are independent parameters. A vocabulary of 50,000 tokens with 512-dimensional embeddings produces a 50,000 × 512 matrix.
  • Cosine similarity is not distance: Cosine similarity measures the angle between vectors, not their Euclidean distance. Two vectors can have high cosine similarity (pointing in the same direction) but very different magnitudes. For some applications, L2 distance or dot product similarity may be more appropriate.

Q1: What is the shape of an embedding matrix for a vocabulary of 10,000 tokens with embedding dimension 256?

  • A) [256, 10000]
  • B) [10000, 256]
  • C) [256, 256]
  • D) [10000, 10000]
Show answer

Answer: B) [10000, 256]

The embedding matrix has shape [V, d] = [10000, 256]. Each of the 10,000 tokens has a 256-dimensional embedding vector, so each row corresponds to one token.

Q2: What does a cosine similarity of 0 between two embedding vectors indicate?

  • A) The tokens are identical
  • B) The tokens are opposite in meaning
  • C) The embedding vectors are orthogonal (perpendicular)
  • D) One of the vectors is a zero vector
Show answer

Answer: C) The embedding vectors are orthogonal (perpendicular)

A cosine similarity of 0 means the two vectors are orthogonal — they point in perpendicular directions in the embedding space. This suggests the tokens are unrelated in the learned representation. A similarity of +1 means identical direction, and -1 means opposite direction.