About Transformer Block

A transformer block is the fundamental building unit of models like BERT, GPT, and LLaMA. It processes input embeddings through a sequence of sublayers: Multi-Head Self-Attention → Add & Layer Norm → Feed-Forward Network → Add & Layer Norm. Residual connections (skip connections) add the input of each sublayer to its output before normalization, preserving the original signal and enabling deep networks. This visualization shows every computation: attention scores, residual additions, layer normalization, and the two-layer FFN with ReLU activation.

Complexity Analysis

Time Complexity
O(n² × d + n × d × d_ff)
Space Complexity
O(n² + n × d_ff)
Difficulty
advanced

Key Concepts

Residual Connections

Each sublayer's output is added to its input: output = x + sublayer(x). This 'skip connection' ensures that gradients can flow directly through the network, enabling training of very deep models (100+ layers). Without residuals, deep transformers fail to train.

Layer Normalization

After each residual addition, layer normalization normalizes each token's vector to have mean ≈ 0 and variance ≈ 1. This stabilizes the internal activations and helps the model train faster. Unlike batch normalization, layer norm operates on individual examples.

Feed-Forward Network

The FFN applies two linear transformations with a ReLU activation: FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂. The hidden dimension (d_ff) is typically 4× the model dimension. This gives each token a nonlinear transformation independent of other tokens.

Block Stacking

Real transformers stack 6 (original), 12 (BERT-base), 24 (GPT-2), or 96+ (GPT-4) identical blocks. Each block refines the representations. The output of one block is the input to the next.

Common Pitfalls

Add Then Norm, Not Norm Then Add

The original Transformer uses Post-Norm: LayerNorm(x + sublayer(x)). Some implementations use Pre-Norm: x + sublayer(LayerNorm(x)). This visualization uses Post-Norm as in the original paper.

FFN Is Per-Token

Unlike attention which mixes information across tokens, the FFN processes each token independently with the same weights. Cross-token interaction only happens in the attention sublayer.

Prerequisites

Understanding these algorithms first will help: