Backpropagation

Category: Deep Learning
Difficulty: Intermediate
Time Complexity: O(Σ n_l × n_{l+1})
Space Complexity: O(Σ n_l × n_{l+1})

Overview

Backpropagation (backward propagation of errors) is the algorithm that makes neural network training possible. After a forward pass computes the network’s prediction, backpropagation computes the gradient of the loss function with respect to every weight in the network by applying the chain rule of calculus layer by layer, from output back to input. These gradients tell each weight how to adjust to reduce the loss. Combined with an optimizer like gradient descent, backpropagation enables the network to learn from data. Introduced by Rumelhart, Hinton, and Williams in 1986, backpropagation remains the foundation of all deep learning training.

Try It

Web: Open in Eigenvue →

Python:

import eigenvue
eigenvue.show("backpropagation")

Default Inputs

{
  "layerSizes": [
    2,
    2,
    1
  ],
  "inputValues": [
    0.5,
    0.8
  ],
  "targets": [
    1
  ],
  "activationFunction": "sigmoid",
  "lossFunction": "mse",
  "learningRate": 0.1,
  "seed": "backprop-default"
}

Input Examples

Default (2-2-1 network)

{
  "layerSizes": [
    2,
    2,
    1
  ],
  "inputValues": [
    0.5,
    0.8
  ],
  "targets": [
    1
  ],
  "activationFunction": "sigmoid",
  "lossFunction": "mse",
  "learningRate": 0.1,
  "seed": "backprop-default"
}

Higher learning rate

{
  "layerSizes": [
    2,
    2,
    1
  ],
  "inputValues": [
    0.5,
    0.8
  ],
  "targets": [
    1
  ],
  "activationFunction": "sigmoid",
  "lossFunction": "mse",
  "learningRate": 0.5,
  "seed": "backprop-default"
}

Code

Pseudocode

function backpropagation(network, input, target, learningRate):
  // Forward pass
  activations = forwardPass(network, input)
  loss = computeLoss(activations[-1], target)

  // Backward pass — compute gradients
  delta = lossDeriv(activations[-1], target)
  for layer in reversed(range(num_layers - 1)):
    weightGrad = delta × activations[layer].T
    biasGrad = delta
    delta = (weights[layer].T × delta) * activDeriv(z[layer])

    // Update weights
    weights[layer] -= learningRate * weightGrad
    biases[layer] -= learningRate * biasGrad

Key Concepts

Chain Rule of Calculus

Backpropagation is an efficient application of the chain rule. To find how a weight in an early layer affects the loss, we multiply the local gradients along the path from the loss back to that weight: ∂Loss/∂w = ∂Loss/∂a × ∂a/∂z × ∂z/∂w.

Gradient Descent

Gradients indicate the direction of steepest increase. To minimize the loss, we move weights in the opposite direction of the gradient: w_new = w_old - η × ∂Loss/∂w, where η is the learning rate.

Forward Then Backward

Training requires two passes: a forward pass to compute predictions and loss, then a backward pass to compute gradients. Both passes have the same computational cost — O(number of weights).

Common Pitfalls

Learning rate too large: A learning rate that is too large causes weight updates to overshoot the minimum, potentially diverging. Too small means painfully slow convergence.
Vanishing gradients in deep networks: When using sigmoid or tanh activations, gradients shrink exponentially as they propagate backward through many layers, making early layers almost impossible to train.

Quiz

Q1: What mathematical tool does backpropagation use to compute gradients?

A) Integration by parts
B) The chain rule of calculus
C) Taylor series expansion
D) Fourier transform

Show answer

Answer: B) The chain rule of calculus

Backpropagation applies the chain rule to compute the gradient of the loss with respect to each weight by multiplying local gradients along the computation path.

Prerequisites

Feedforward Neural Network