Transformer Block

Category: Generative AI
Difficulty: Advanced
Time Complexity: O(n² × d + n × d × d_ff)
Space Complexity: O(n² + n × d_ff)

Overview

A transformer block is the fundamental building unit of models like BERT, GPT, and LLaMA. It processes input embeddings through a sequence of sublayers: Multi-Head Self-Attention → Add & Layer Norm → Feed-Forward Network → Add & Layer Norm. Residual connections (skip connections) add the input of each sublayer to its output before normalization, preserving the original signal and enabling deep networks. This visualization shows every computation: attention scores, residual additions, layer normalization, and the two-layer FFN with ReLU activation.

Try It

Web: Open in Eigenvue →

Python:

import eigenvue
eigenvue.show("transformer-block")

Default Inputs

{
  "tokens": [
    "The",
    "cat",
    "sat"
  ],
  "embeddingDim": 4,
  "ffnDim": 8,
  "numHeads": 1
}

Input Examples

3 tokens, 1 head (dim=4, ffn=8)

{
  "tokens": [
    "The",
    "cat",
    "sat"
  ],
  "embeddingDim": 4,
  "ffnDim": 8,
  "numHeads": 1
}

4 tokens, 2 heads (dim=8, ffn=16)

{
  "tokens": [
    "The",
    "cat",
    "sat",
    "down"
  ],
  "embeddingDim": 8,
  "ffnDim": 16,
  "numHeads": 2
}

Code

Pseudocode

function TransformerBlock(X, W_attn, W_ffn):
  // Sublayer 1: Self-Attention + Add & Norm
  attn_output = MultiHeadAttention(X)
  residual_1 = X + attn_output          // residual connection
  norm_1 = LayerNorm(residual_1)

  // Sublayer 2: FFN + Add & Norm
  ffn_output = FFN(norm_1)
    = ReLU(norm_1 × W₁ + b₁) × W₂ + b₂
  residual_2 = norm_1 + ffn_output       // residual connection
  norm_2 = LayerNorm(residual_2)

  return norm_2

Python

def transformer_block(X, attn_params, ffn_params):
    # Self-Attention + Add & Norm
    attn_out = multi_head_attention(X, **attn_params)
    residual_1 = X + attn_out
    norm_1 = layer_norm(residual_1)

    # FFN + Add & Norm
    ffn_out = relu(norm_1 @ W1 + b1) @ W2 + b2
    residual_2 = norm_1 + ffn_out
    norm_2 = layer_norm(residual_2)
    return norm_2

Key Concepts

Residual Connections

Each sublayer’s output is added to its input: output = x + sublayer(x). This ‘skip connection’ ensures that gradients can flow directly through the network, enabling training of very deep models (100+ layers). Without residuals, deep transformers fail to train.

Layer Normalization

After each residual addition, layer normalization normalizes each token’s vector to have mean ≈ 0 and variance ≈ 1. This stabilizes the internal activations and helps the model train faster. Unlike batch normalization, layer norm operates on individual examples.

Feed-Forward Network

The FFN applies two linear transformations with a ReLU activation: FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂. The hidden dimension (d_ff) is typically 4× the model dimension. This gives each token a nonlinear transformation independent of other tokens.

Block Stacking

Real transformers stack 6 (original), 12 (BERT-base), 24 (GPT-2), or 96+ (GPT-4) identical blocks. Each block refines the representations. The output of one block is the input to the next.

Common Pitfalls

Add Then Norm, Not Norm Then Add: The original Transformer uses Post-Norm: LayerNorm(x + sublayer(x)). Some implementations use Pre-Norm: x + sublayer(LayerNorm(x)). This visualization uses Post-Norm as in the original paper.
FFN Is Per-Token: Unlike attention which mixes information across tokens, the FFN processes each token independently with the same weights. Cross-token interaction only happens in the attention sublayer.

Quiz

Q1: What is the purpose of the residual connection in a transformer block?

A) To reduce the number of parameters
B) To allow gradients to flow directly through the network
C) To increase the embedding dimension
D) To apply nonlinearity

Show answer

Answer: B) To allow gradients to flow directly through the network

Residual connections create a direct path for gradients, preventing the vanishing gradient problem in deep networks. The identity mapping lets the network learn the ‘delta’ (what to add) rather than the full transformation.

Q2: If d_model = 512 and d_ff = 2048, how many parameters does the FFN have (ignoring biases)?

A) 512 × 512
B) 512 × 2048
C) 2 × 512 × 2048
D) 512 × 2048 × 512