Skip to content

Transformer Block

Category: Generative AI
Difficulty: Advanced
Time Complexity: O(n² × d + n × d × d_ff)
Space Complexity: O(n² + n × d_ff)

A transformer block is the fundamental building unit of models like BERT, GPT, and LLaMA. It processes input embeddings through a sequence of sublayers: Multi-Head Self-Attention → Add & Layer Norm → Feed-Forward Network → Add & Layer Norm. Residual connections (skip connections) add the input of each sublayer to its output before normalization, preserving the original signal and enabling deep networks. This visualization shows every computation: attention scores, residual additions, layer normalization, and the two-layer FFN with ReLU activation.

{
"tokens": [
"The",
"cat",
"sat"
],
"embeddingDim": 4,
"ffnDim": 8,
"numHeads": 1
}
{
"tokens": [
"The",
"cat",
"sat"
],
"embeddingDim": 4,
"ffnDim": 8,
"numHeads": 1
}
{
"tokens": [
"The",
"cat",
"sat",
"down"
],
"embeddingDim": 8,
"ffnDim": 16,
"numHeads": 2
}
function TransformerBlock(X, W_attn, W_ffn):
// Sublayer 1: Self-Attention + Add & Norm
attn_output = MultiHeadAttention(X)
residual_1 = X + attn_output // residual connection
norm_1 = LayerNorm(residual_1)
// Sublayer 2: FFN + Add & Norm
ffn_output = FFN(norm_1)
= ReLU(norm_1 × W₁ + b₁) × W₂ + b₂
residual_2 = norm_1 + ffn_output // residual connection
norm_2 = LayerNorm(residual_2)
return norm_2
def transformer_block(X, attn_params, ffn_params):
# Self-Attention + Add & Norm
attn_out = multi_head_attention(X, **attn_params)
residual_1 = X + attn_out
norm_1 = layer_norm(residual_1)
# FFN + Add & Norm
ffn_out = relu(norm_1 @ W1 + b1) @ W2 + b2
residual_2 = norm_1 + ffn_out
norm_2 = layer_norm(residual_2)
return norm_2

Each sublayer’s output is added to its input: output = x + sublayer(x). This ‘skip connection’ ensures that gradients can flow directly through the network, enabling training of very deep models (100+ layers). Without residuals, deep transformers fail to train.

After each residual addition, layer normalization normalizes each token’s vector to have mean ≈ 0 and variance ≈ 1. This stabilizes the internal activations and helps the model train faster. Unlike batch normalization, layer norm operates on individual examples.

The FFN applies two linear transformations with a ReLU activation: FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂. The hidden dimension (d_ff) is typically 4× the model dimension. This gives each token a nonlinear transformation independent of other tokens.

Real transformers stack 6 (original), 12 (BERT-base), 24 (GPT-2), or 96+ (GPT-4) identical blocks. Each block refines the representations. The output of one block is the input to the next.

  • Add Then Norm, Not Norm Then Add: The original Transformer uses Post-Norm: LayerNorm(x + sublayer(x)). Some implementations use Pre-Norm: x + sublayer(LayerNorm(x)). This visualization uses Post-Norm as in the original paper.
  • FFN Is Per-Token: Unlike attention which mixes information across tokens, the FFN processes each token independently with the same weights. Cross-token interaction only happens in the attention sublayer.

Q1: What is the purpose of the residual connection in a transformer block?

  • A) To reduce the number of parameters
  • B) To allow gradients to flow directly through the network
  • C) To increase the embedding dimension
  • D) To apply nonlinearity
Show answer

Answer: B) To allow gradients to flow directly through the network

Residual connections create a direct path for gradients, preventing the vanishing gradient problem in deep networks. The identity mapping lets the network learn the ‘delta’ (what to add) rather than the full transformation.

Q2: If d_model = 512 and d_ff = 2048, how many parameters does the FFN have (ignoring biases)?

  • A) 512 × 512
  • B) 512 × 2048
  • C) 2 × 512 × 2048
  • D) 512 × 2048 × 512
Show answer

Answer: C) 2 × 512 × 2048

The FFN has two weight matrices: W₁ [512, 2048] and W₂ [2048, 512]. Total parameters = 512×2048 + 2048×512 = 2 × 512 × 2048 = 2,097,152.