Skip to content

Backpropagation

Category: Deep Learning
Difficulty: Intermediate
Time Complexity: O(Σ n_l × n_{l+1})
Space Complexity: O(Σ n_l × n_{l+1})

Backpropagation (backward propagation of errors) is the algorithm that makes neural network training possible. After a forward pass computes the network’s prediction, backpropagation computes the gradient of the loss function with respect to every weight in the network by applying the chain rule of calculus layer by layer, from output back to input. These gradients tell each weight how to adjust to reduce the loss. Combined with an optimizer like gradient descent, backpropagation enables the network to learn from data. Introduced by Rumelhart, Hinton, and Williams in 1986, backpropagation remains the foundation of all deep learning training.

{
"layerSizes": [
2,
2,
1
],
"inputValues": [
0.5,
0.8
],
"targets": [
1
],
"activationFunction": "sigmoid",
"lossFunction": "mse",
"learningRate": 0.1,
"seed": "backprop-default"
}
{
"layerSizes": [
2,
2,
1
],
"inputValues": [
0.5,
0.8
],
"targets": [
1
],
"activationFunction": "sigmoid",
"lossFunction": "mse",
"learningRate": 0.1,
"seed": "backprop-default"
}
{
"layerSizes": [
2,
2,
1
],
"inputValues": [
0.5,
0.8
],
"targets": [
1
],
"activationFunction": "sigmoid",
"lossFunction": "mse",
"learningRate": 0.5,
"seed": "backprop-default"
}
function backpropagation(network, input, target, learningRate):
// Forward pass
activations = forwardPass(network, input)
loss = computeLoss(activations[-1], target)
// Backward pass — compute gradients
delta = lossDeriv(activations[-1], target)
for layer in reversed(range(num_layers - 1)):
weightGrad = delta × activations[layer].T
biasGrad = delta
delta = (weights[layer].T × delta) * activDeriv(z[layer])
// Update weights
weights[layer] -= learningRate * weightGrad
biases[layer] -= learningRate * biasGrad

Backpropagation is an efficient application of the chain rule. To find how a weight in an early layer affects the loss, we multiply the local gradients along the path from the loss back to that weight: ∂Loss/∂w = ∂Loss/∂a × ∂a/∂z × ∂z/∂w.

Gradients indicate the direction of steepest increase. To minimize the loss, we move weights in the opposite direction of the gradient: w_new = w_old - η × ∂Loss/∂w, where η is the learning rate.

Training requires two passes: a forward pass to compute predictions and loss, then a backward pass to compute gradients. Both passes have the same computational cost — O(number of weights).

  • Learning rate too large: A learning rate that is too large causes weight updates to overshoot the minimum, potentially diverging. Too small means painfully slow convergence.
  • Vanishing gradients in deep networks: When using sigmoid or tanh activations, gradients shrink exponentially as they propagate backward through many layers, making early layers almost impossible to train.

Q1: What mathematical tool does backpropagation use to compute gradients?

  • A) Integration by parts
  • B) The chain rule of calculus
  • C) Taylor series expansion
  • D) Fourier transform
Show answer

Answer: B) The chain rule of calculus

Backpropagation applies the chain rule to compute the gradient of the loss with respect to each weight by multiplying local gradients along the computation path.