Self-Attention (Scaled Dot-Product)
Category: Generative AI
Difficulty: Intermediate
Time Complexity: O(n² × d)
Space Complexity: O(n² + n × d)
Overview
Section titled “Overview”Self-attention is the core mechanism inside Transformer models. Given a sequence of token embeddings, each token creates three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). Attention scores are computed as the dot product of queries and keys, scaled by 1/√d_k to prevent gradient vanishing, then normalized with softmax to produce a probability distribution. Each token’s output is a weighted sum of all value vectors, where the weights reflect how relevant each other token is. This allows every token to directly attend to every other token in the sequence, capturing long-range dependencies that recurrent models struggle with. Self-attention is the building block of the Transformer architecture introduced in ‘Attention Is All You Need’ (Vaswani et al., 2017).
Try It
Section titled “Try It”- Web: Open in Eigenvue →
- Python:
import eigenvueeigenvue.show("self-attention")
Default Inputs
Section titled “Default Inputs”{ "tokens": [ "The", "cat", "sat" ], "embeddingDim": 4}Input Examples
Section titled “Input Examples”3 tokens, dim 4 (default)
Section titled “3 tokens, dim 4 (default)”{ "tokens": [ "The", "cat", "sat" ], "embeddingDim": 4}4 tokens, dim 4
Section titled “4 tokens, dim 4”{ "tokens": [ "The", "cat", "sat", "down" ], "embeddingDim": 4}Short sentence, dim 3
Section titled “Short sentence, dim 3”{ "tokens": [ "I", "love", "AI" ], "embeddingDim": 3}5 tokens, dim 4
Section titled “5 tokens, dim 4”{ "tokens": [ "The", "quick", "brown", "fox", "jumps" ], "embeddingDim": 4}Pseudocode
Section titled “Pseudocode”function SelfAttention(X, W_Q, W_K, W_V): Q = X × W_Q // Query projection K = X × W_K // Key projection V = X × W_V // Value projection scores = Q × Kᵀ // Raw attention scores scaled = scores / √d_k // Scale to stabilize gradients weights = softmax(scaled, dim=row) // Normalize each row for each query token q: output[q] = Σ weights[q][j] × V[j] // Weighted sum return outputKey Concepts
Section titled “Key Concepts”Query, Key, Value Intuition
Section titled “Query, Key, Value Intuition”Think of attention as an information retrieval system. The Query is a search query (‘what am I looking for?’), the Key is an index entry (‘what do I contain?’), and the Value is the actual content (‘here is my information’). The dot product of Q and K measures relevance, and the result is used to weight the Values.
Softmax Normalization
Section titled “Softmax Normalization”Softmax converts raw attention scores into a probability distribution where all weights are non-negative and each row sums to exactly 1.0. This means each token distributes 100% of its attention across all tokens in the sequence, with higher weights on more relevant tokens.
Scaling by 1/√d_k
Section titled “Scaling by 1/√d_k”When the embedding dimension d_k is large, dot products grow in magnitude (their variance scales with d_k). Large dot products push softmax into saturated regions where gradients are extremely small, making learning slow. Dividing by √d_k keeps the variance at approximately 1, ensuring healthy gradients.
Quadratic Complexity
Section titled “Quadratic Complexity”Self-attention computes a score for every pair of tokens, giving O(n²) time and space complexity in sequence length. This is why Transformers struggle with very long sequences (e.g., 100K+ tokens) and why research into efficient attention variants (linear attention, sparse attention) is active.
Common Pitfalls
Section titled “Common Pitfalls”- Scaling factor is √d_k, NOT √d_model: In multi-head attention, each head has dimension d_k = d_model / num_heads. The scaling factor uses d_k (the per-head dimension), not d_model (the full model dimension). Using d_model would over-scale and flatten the attention distribution.
- Softmax is applied row-wise, not element-wise: Softmax must be applied independently to each row of the score matrix. Each row corresponds to one query token’s attention distribution. Applying softmax to the entire matrix or column-wise would produce incorrect attention weights.
- Attention weights do not encode position: Pure self-attention is permutation-equivariant — it treats the input as a set, not a sequence. Without positional encodings added to the embeddings, the model cannot distinguish ‘the cat sat’ from ‘sat cat the’.
Q1: What does the softmax function ensure about each row of the attention weight matrix?
- A) All values are between -1 and 1
- B) All values are non-negative and each row sums to 1.0
- C) The matrix is symmetric
- D) The diagonal values are the largest
Show answer
Answer: B) All values are non-negative and each row sums to 1.0
Softmax exponentiates each score (making it positive) and divides by the row sum, guaranteeing all weights are non-negative and each row sums to exactly 1.0. This creates a valid probability distribution over the key tokens for each query token.
Q2: Why are attention scores divided by √d_k before applying softmax?
- A) To make the matrix square
- B) To reduce memory usage
- C) To prevent large dot products from causing vanishing gradients in softmax
- D) To normalize the output to unit length
Show answer
Answer: C) To prevent large dot products from causing vanishing gradients in softmax
When d_k is large, dot products tend to have large magnitudes. Large inputs to softmax produce outputs very close to 0 or 1, where gradients are near zero. Scaling by 1/√d_k keeps dot product magnitudes manageable, ensuring softmax operates in a region with healthy gradients.
Q3: What is the time complexity of self-attention with respect to sequence length n?
- A) O(n)
- B) O(n log n)
- C) O(n² × d)
- D) O(n³)
Show answer
Answer: C) O(n² × d)
Self-attention computes the dot product of every query with every key (n² pairs), and each dot product involves d_k dimensions, giving O(n² × d) time. The n² factor is why self-attention is expensive for long sequences.
Further Reading
Section titled “Further Reading”- Attention Is All You Need (Vaswani et al., 2017) (paper)
- The Illustrated Transformer — Jay Alammar (article)
- Self-Attention — Wikipedia (reference)