About Multi-Head Attention

Multi-Head Attention is a core mechanism in the Transformer architecture that runs several attention functions in parallel. Instead of computing a single attention pass with d_model-dimensional keys, queries, and values, the model splits them into h heads, each operating on d_k = d_model / h dimensions. Each head learns to attend to different aspects of the input — one head might focus on syntactic relationships while another captures semantic similarity. After all heads compute their attention independently, their outputs are concatenated back into a d_model-dimensional vector and passed through a final linear projection W_O. This projection allows the model to mix information across heads. The key insight is that multiple smaller attention operations are more expressive than a single large one, enabling the model to jointly attend to information from different representation subspaces at different positions.

Complexity Analysis

Time Complexity
O(h × n² × d_k)
Space Complexity
O(h × n² + n × d)
Difficulty
advanced

Key Concepts

Why Multiple Heads?

A single attention head can only learn one type of relationship between tokens. Multiple heads allow the model to simultaneously attend to different aspects — for example, one head might learn syntactic dependencies (subject-verb agreement) while another learns semantic relationships (word meaning similarity). This is analogous to having multiple 'perspectives' on the same data.

d_k vs d_model

Each head operates on d_k = d_model / numHeads dimensions, NOT the full d_model. This means the total computation is roughly the same as single-head attention with full dimensionality, but the model gains the expressiveness of multiple independent attention patterns. The scaling factor in each head uses √d_k, not √d_model.

Concatenation and W_O Projection

After all heads compute their outputs (each of shape [n, d_k]), they are concatenated along the feature dimension to form a [n, d_model] matrix. The final W_O projection (shape [d_model, d_model]) mixes information across heads, allowing the model to combine the different perspectives into a unified representation.

Common Pitfalls

d_model must be divisible by numHeads

If d_model is not evenly divisible by numHeads, the dimension split is impossible. For example, d_model=7 with numHeads=2 fails because 7/2 = 3.5 is not an integer. Always choose numHeads to be a factor of d_model.

Scaling factor uses d_k, not d_model

A common mistake is scaling by √d_model instead of √d_k. Each head's attention scores are dot products of d_k-dimensional vectors, so the expected magnitude scales with d_k. Using √d_model would under-scale the scores, leading to overly uniform attention weights.

Head outputs must be concatenated, not summed

The outputs from different heads are concatenated along the feature axis, not summed or averaged. Summing would lose information and reduce the effective dimensionality. Concatenation preserves all head outputs and lets W_O learn how to combine them.

Prerequisites

Understanding these algorithms first will help: