Self-Attention (Scaled Dot-Product)

Category: Generative AI
Difficulty: Intermediate
Time Complexity: O(n² × d)
Space Complexity: O(n² + n × d)

Overview

Self-attention is the core mechanism inside Transformer models. Given a sequence of token embeddings, each token creates three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). Attention scores are computed as the dot product of queries and keys, scaled by 1/√d_k to prevent gradient vanishing, then normalized with softmax to produce a probability distribution. Each token’s output is a weighted sum of all value vectors, where the weights reflect how relevant each other token is. This allows every token to directly attend to every other token in the sequence, capturing long-range dependencies that recurrent models struggle with. Self-attention is the building block of the Transformer architecture introduced in ‘Attention Is All You Need’ (Vaswani et al., 2017).

Try It

Web: Open in Eigenvue →

Python:

import eigenvue
eigenvue.show("self-attention")

Default Inputs

{
  "tokens": [
    "The",
    "cat",
    "sat"
  ],
  "embeddingDim": 4
}

Input Examples

3 tokens, dim 4 (default)

{
  "tokens": [
    "The",
    "cat",
    "sat"
  ],
  "embeddingDim": 4
}

4 tokens, dim 4

{
  "tokens": [
    "The",
    "cat",
    "sat",
    "down"
  ],
  "embeddingDim": 4
}

Short sentence, dim 3

{
  "tokens": [
    "I",
    "love",
    "AI"
  ],
  "embeddingDim": 3
}

5 tokens, dim 4

{
  "tokens": [
    "The",
    "quick",
    "brown",
    "fox",
    "jumps"
  ],
  "embeddingDim": 4
}

Code

Pseudocode

function SelfAttention(X, W_Q, W_K, W_V):
  Q = X × W_Q            // Query projection
  K = X × W_K            // Key projection
  V = X × W_V            // Value projection
  scores = Q × Kᵀ        // Raw attention scores
  scaled = scores / √d_k  // Scale to stabilize gradients
  weights = softmax(scaled, dim=row)  // Normalize each row
  for each query token q:
    output[q] = Σ weights[q][j] × V[j]  // Weighted sum
  return output

Key Concepts

Query, Key, Value Intuition

Think of attention as an information retrieval system. The Query is a search query (‘what am I looking for?’), the Key is an index entry (‘what do I contain?’), and the Value is the actual content (‘here is my information’). The dot product of Q and K measures relevance, and the result is used to weight the Values.

Softmax Normalization

Softmax converts raw attention scores into a probability distribution where all weights are non-negative and each row sums to exactly 1.0. This means each token distributes 100% of its attention across all tokens in the sequence, with higher weights on more relevant tokens.

Scaling by 1/√d_k

When the embedding dimension d_k is large, dot products grow in magnitude (their variance scales with d_k). Large dot products push softmax into saturated regions where gradients are extremely small, making learning slow. Dividing by √d_k keeps the variance at approximately 1, ensuring healthy gradients.

Quadratic Complexity

Self-attention computes a score for every pair of tokens, giving O(n²) time and space complexity in sequence length. This is why Transformers struggle with very long sequences (e.g., 100K+ tokens) and why research into efficient attention variants (linear attention, sparse attention) is active.

Common Pitfalls

Scaling factor is √d_k, NOT √d_model: In multi-head attention, each head has dimension d_k = d_model / num_heads. The scaling factor uses d_k (the per-head dimension), not d_model (the full model dimension). Using d_model would over-scale and flatten the attention distribution.
Softmax is applied row-wise, not element-wise: Softmax must be applied independently to each row of the score matrix. Each row corresponds to one query token’s attention distribution. Applying softmax to the entire matrix or column-wise would produce incorrect attention weights.
Attention weights do not encode position: Pure self-attention is permutation-equivariant — it treats the input as a set, not a sequence. Without positional encodings added to the embeddings, the model cannot distinguish ‘the cat sat’ from ‘sat cat the’.

Quiz

Q1: What does the softmax function ensure about each row of the attention weight matrix?

A) All values are between -1 and 1
B) All values are non-negative and each row sums to 1.0
C) The matrix is symmetric
D) The diagonal values are the largest

Show answer

Answer: B) All values are non-negative and each row sums to 1.0

Softmax exponentiates each score (making it positive) and divides by the row sum, guaranteeing all weights are non-negative and each row sums to exactly 1.0. This creates a valid probability distribution over the key tokens for each query token.

Q2: Why are attention scores divided by √d_k before applying softmax?

A) To make the matrix square
B) To reduce memory usage
C) To prevent large dot products from causing vanishing gradients in softmax
D) To normalize the output to unit length

Show answer

Answer: C) To prevent large dot products from causing vanishing gradients in softmax

When d_k is large, dot products tend to have large magnitudes. Large inputs to softmax produce outputs very close to 0 or 1, where gradients are near zero. Scaling by 1/√d_k keeps dot product magnitudes manageable, ensuring softmax operates in a region with healthy gradients.

Q3: What is the time complexity of self-attention with respect to sequence length n?

A) O(n)
B) O(n log n)
C) O(n² × d)
D) O(n³)

Show answer

Answer: C) O(n² × d)

Self-attention computes the dot product of every query with every key (n² pairs), and each dot product involves d_k dimensions, giving O(n² × d) time. The n² factor is why self-attention is expensive for long sequences.

Prerequisites

Token Embeddings

Self-Attention (Scaled Dot-Product)

Overview

Try It

Default Inputs

Input Examples

3 tokens, dim 4 (default)

4 tokens, dim 4

Short sentence, dim 3

5 tokens, dim 4

Code

Pseudocode

Key Concepts

Query, Key, Value Intuition

Softmax Normalization

Scaling by 1/√d_k

Quadratic Complexity

Common Pitfalls

Quiz

Further Reading

Related Algorithms

Prerequisites