About Self-Attention (Scaled Dot-Product)

Self-attention is the core mechanism inside Transformer models. Given a sequence of token embeddings, each token creates three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). Attention scores are computed as the dot product of queries and keys, scaled by 1/√d_k to prevent gradient vanishing, then normalized with softmax to produce a probability distribution. Each token's output is a weighted sum of all value vectors, where the weights reflect how relevant each other token is. This allows every token to directly attend to every other token in the sequence, capturing long-range dependencies that recurrent models struggle with. Self-attention is the building block of the Transformer architecture introduced in 'Attention Is All You Need' (Vaswani et al., 2017).

Complexity Analysis

Time Complexity
O(n² × d)
Space Complexity
O(n² + n × d)
Difficulty
intermediate

Key Concepts

Query, Key, Value Intuition

Think of attention as an information retrieval system. The Query is a search query ('what am I looking for?'), the Key is an index entry ('what do I contain?'), and the Value is the actual content ('here is my information'). The dot product of Q and K measures relevance, and the result is used to weight the Values.

Softmax Normalization

Softmax converts raw attention scores into a probability distribution where all weights are non-negative and each row sums to exactly 1.0. This means each token distributes 100% of its attention across all tokens in the sequence, with higher weights on more relevant tokens.

Scaling by 1/√d_k

When the embedding dimension d_k is large, dot products grow in magnitude (their variance scales with d_k). Large dot products push softmax into saturated regions where gradients are extremely small, making learning slow. Dividing by √d_k keeps the variance at approximately 1, ensuring healthy gradients.

Quadratic Complexity

Self-attention computes a score for every pair of tokens, giving O(n²) time and space complexity in sequence length. This is why Transformers struggle with very long sequences (e.g., 100K+ tokens) and why research into efficient attention variants (linear attention, sparse attention) is active.

Common Pitfalls

Scaling factor is √d_k, NOT √d_model

In multi-head attention, each head has dimension d_k = d_model / num_heads. The scaling factor uses d_k (the per-head dimension), not d_model (the full model dimension). Using d_model would over-scale and flatten the attention distribution.

Softmax is applied row-wise, not element-wise

Softmax must be applied independently to each row of the score matrix. Each row corresponds to one query token's attention distribution. Applying softmax to the entire matrix or column-wise would produce incorrect attention weights.

Attention weights do not encode position

Pure self-attention is permutation-equivariant — it treats the input as a set, not a sequence. Without positional encodings added to the embeddings, the model cannot distinguish 'the cat sat' from 'sat cat the'.

Prerequisites

Understanding these algorithms first will help: