About Transformer Block
A transformer block is the fundamental building unit of models like BERT, GPT, and LLaMA. It processes input embeddings through a sequence of sublayers: Multi-Head Self-Attention → Add & Layer Norm → Feed-Forward Network → Add & Layer Norm. Residual connections (skip connections) add the input of each sublayer to its output before normalization, preserving the original signal and enabling deep networks. This visualization shows every computation: attention scores, residual additions, layer normalization, and the two-layer FFN with ReLU activation.
Complexity Analysis
- Time Complexity
- O(n² × d + n × d × d_ff)
- Space Complexity
- O(n² + n × d_ff)
- Difficulty
- advanced
Key Concepts
Residual Connections
Each sublayer's output is added to its input: output = x + sublayer(x). This 'skip connection' ensures that gradients can flow directly through the network, enabling training of very deep models (100+ layers). Without residuals, deep transformers fail to train.
Layer Normalization
After each residual addition, layer normalization normalizes each token's vector to have mean ≈ 0 and variance ≈ 1. This stabilizes the internal activations and helps the model train faster. Unlike batch normalization, layer norm operates on individual examples.
Feed-Forward Network
The FFN applies two linear transformations with a ReLU activation: FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂. The hidden dimension (d_ff) is typically 4× the model dimension. This gives each token a nonlinear transformation independent of other tokens.
Block Stacking
Real transformers stack 6 (original), 12 (BERT-base), 24 (GPT-2), or 96+ (GPT-4) identical blocks. Each block refines the representations. The output of one block is the input to the next.
Common Pitfalls
Add Then Norm, Not Norm Then Add
The original Transformer uses Post-Norm: LayerNorm(x + sublayer(x)). Some implementations use Pre-Norm: x + sublayer(LayerNorm(x)). This visualization uses Post-Norm as in the original paper.
FFN Is Per-Token
Unlike attention which mixes information across tokens, the FFN processes each token independently with the same weights. Cross-token interaction only happens in the attention sublayer.