Transformer Block
Category: Generative AI
Difficulty: Advanced
Time Complexity: O(n² × d + n × d × d_ff)
Space Complexity: O(n² + n × d_ff)
Overview
Section titled “Overview”A transformer block is the fundamental building unit of models like BERT, GPT, and LLaMA. It processes input embeddings through a sequence of sublayers: Multi-Head Self-Attention → Add & Layer Norm → Feed-Forward Network → Add & Layer Norm. Residual connections (skip connections) add the input of each sublayer to its output before normalization, preserving the original signal and enabling deep networks. This visualization shows every computation: attention scores, residual additions, layer normalization, and the two-layer FFN with ReLU activation.
Try It
Section titled “Try It”- Web: Open in Eigenvue →
- Python:
import eigenvueeigenvue.show("transformer-block")
Default Inputs
Section titled “Default Inputs”{ "tokens": [ "The", "cat", "sat" ], "embeddingDim": 4, "ffnDim": 8, "numHeads": 1}Input Examples
Section titled “Input Examples”3 tokens, 1 head (dim=4, ffn=8)
Section titled “3 tokens, 1 head (dim=4, ffn=8)”{ "tokens": [ "The", "cat", "sat" ], "embeddingDim": 4, "ffnDim": 8, "numHeads": 1}4 tokens, 2 heads (dim=8, ffn=16)
Section titled “4 tokens, 2 heads (dim=8, ffn=16)”{ "tokens": [ "The", "cat", "sat", "down" ], "embeddingDim": 8, "ffnDim": 16, "numHeads": 2}Pseudocode
Section titled “Pseudocode”function TransformerBlock(X, W_attn, W_ffn): // Sublayer 1: Self-Attention + Add & Norm attn_output = MultiHeadAttention(X) residual_1 = X + attn_output // residual connection norm_1 = LayerNorm(residual_1)
// Sublayer 2: FFN + Add & Norm ffn_output = FFN(norm_1) = ReLU(norm_1 × W₁ + b₁) × W₂ + b₂ residual_2 = norm_1 + ffn_output // residual connection norm_2 = LayerNorm(residual_2)
return norm_2Python
Section titled “Python”def transformer_block(X, attn_params, ffn_params): # Self-Attention + Add & Norm attn_out = multi_head_attention(X, **attn_params) residual_1 = X + attn_out norm_1 = layer_norm(residual_1)
# FFN + Add & Norm ffn_out = relu(norm_1 @ W1 + b1) @ W2 + b2 residual_2 = norm_1 + ffn_out norm_2 = layer_norm(residual_2) return norm_2Key Concepts
Section titled “Key Concepts”Residual Connections
Section titled “Residual Connections”Each sublayer’s output is added to its input: output = x + sublayer(x). This ‘skip connection’ ensures that gradients can flow directly through the network, enabling training of very deep models (100+ layers). Without residuals, deep transformers fail to train.
Layer Normalization
Section titled “Layer Normalization”After each residual addition, layer normalization normalizes each token’s vector to have mean ≈ 0 and variance ≈ 1. This stabilizes the internal activations and helps the model train faster. Unlike batch normalization, layer norm operates on individual examples.
Feed-Forward Network
Section titled “Feed-Forward Network”The FFN applies two linear transformations with a ReLU activation: FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂. The hidden dimension (d_ff) is typically 4× the model dimension. This gives each token a nonlinear transformation independent of other tokens.
Block Stacking
Section titled “Block Stacking”Real transformers stack 6 (original), 12 (BERT-base), 24 (GPT-2), or 96+ (GPT-4) identical blocks. Each block refines the representations. The output of one block is the input to the next.
Common Pitfalls
Section titled “Common Pitfalls”- Add Then Norm, Not Norm Then Add: The original Transformer uses Post-Norm: LayerNorm(x + sublayer(x)). Some implementations use Pre-Norm: x + sublayer(LayerNorm(x)). This visualization uses Post-Norm as in the original paper.
- FFN Is Per-Token: Unlike attention which mixes information across tokens, the FFN processes each token independently with the same weights. Cross-token interaction only happens in the attention sublayer.
Q1: What is the purpose of the residual connection in a transformer block?
- A) To reduce the number of parameters
- B) To allow gradients to flow directly through the network
- C) To increase the embedding dimension
- D) To apply nonlinearity
Show answer
Answer: B) To allow gradients to flow directly through the network
Residual connections create a direct path for gradients, preventing the vanishing gradient problem in deep networks. The identity mapping lets the network learn the ‘delta’ (what to add) rather than the full transformation.
Q2: If d_model = 512 and d_ff = 2048, how many parameters does the FFN have (ignoring biases)?
- A) 512 × 512
- B) 512 × 2048
- C) 2 × 512 × 2048
- D) 512 × 2048 × 512
Show answer
Answer: C) 2 × 512 × 2048
The FFN has two weight matrices: W₁ [512, 2048] and W₂ [2048, 512]. Total parameters = 512×2048 + 2048×512 = 2 × 512 × 2048 = 2,097,152.
Further Reading
Section titled “Further Reading”- Vaswani et al. 2017 — Attention Is All You Need (paper)
- The Annotated Transformer — Harvard NLP (tutorial)