About Token Embeddings
Token embeddings are the first step in any transformer-based language model. Each token in the vocabulary is assigned a dense vector of real numbers — its embedding. These vectors are learned during training so that semantically similar tokens end up with similar vectors (high cosine similarity). This visualization shows the embedding lookup process step by step: given a sequence of tokens, each token is mapped to its embedding vector, and pairwise cosine similarities are computed to illustrate the geometric relationships between tokens in the embedding space.
Complexity Analysis
- Time Complexity
- O(n × d)
- Space Complexity
- O(V × d)
- Difficulty
- intermediate
Key Concepts
Embedding as Lookup
An embedding layer is essentially a lookup table. Each token ID maps to a row in the embedding matrix. There is no computation involved — just an index-based retrieval of a pre-stored vector. The embedding matrix has shape [V, d] where V is the vocabulary size and d is the embedding dimension.
Cosine Similarity
Cosine similarity measures the angle between two vectors, ignoring their magnitude. It is computed as cos(θ) = (A · B) / (‖A‖ × ‖B‖) and ranges from -1 (opposite directions) to +1 (same direction). In embedding space, high cosine similarity between two token vectors suggests semantic relatedness.
Embedding Dimensions
The embedding dimension d determines the expressiveness of token representations. Small models use d=64 or d=128, while GPT-3 uses d=12288. Higher dimensions can capture more nuanced semantic relationships but require more memory and computation. The choice of d is a key architectural hyperparameter.
Learned Representations
Embedding vectors are not hand-crafted — they are learned during model training via backpropagation. The training process adjusts vectors so that tokens appearing in similar contexts end up with similar embeddings. This is the distributional hypothesis: 'a word is characterized by the company it keeps.'
Common Pitfalls
Embeddings are learned, not fixed
A common misconception is that embeddings are predetermined or based on character similarity. In reality, the embedding for each token is learned during training. The token 'cat' and 'cat' (same string) will always have the same embedding, but 'cat' and 'bat' may have very different embeddings despite differing by one character.
Confusing embedding dimension with vocabulary size
The embedding matrix has shape [V, d] where V is the vocabulary size (number of unique tokens) and d is the embedding dimension (vector length). These are independent parameters. A vocabulary of 50,000 tokens with 512-dimensional embeddings produces a 50,000 × 512 matrix.
Cosine similarity is not distance
Cosine similarity measures the angle between vectors, not their Euclidean distance. Two vectors can have high cosine similarity (pointing in the same direction) but very different magnitudes. For some applications, L2 distance or dot product similarity may be more appropriate.