Backpropagation
Category: Deep Learning
Difficulty: Intermediate
Time Complexity: O(Σ n_l × n_{l+1})
Space Complexity: O(Σ n_l × n_{l+1})
Overview
Section titled “Overview”Backpropagation (backward propagation of errors) is the algorithm that makes neural network training possible. After a forward pass computes the network’s prediction, backpropagation computes the gradient of the loss function with respect to every weight in the network by applying the chain rule of calculus layer by layer, from output back to input. These gradients tell each weight how to adjust to reduce the loss. Combined with an optimizer like gradient descent, backpropagation enables the network to learn from data. Introduced by Rumelhart, Hinton, and Williams in 1986, backpropagation remains the foundation of all deep learning training.
Try It
Section titled “Try It”- Web: Open in Eigenvue →
- Python:
import eigenvueeigenvue.show("backpropagation")
Default Inputs
Section titled “Default Inputs”{ "layerSizes": [ 2, 2, 1 ], "inputValues": [ 0.5, 0.8 ], "targets": [ 1 ], "activationFunction": "sigmoid", "lossFunction": "mse", "learningRate": 0.1, "seed": "backprop-default"}Input Examples
Section titled “Input Examples”Default (2-2-1 network)
Section titled “Default (2-2-1 network)”{ "layerSizes": [ 2, 2, 1 ], "inputValues": [ 0.5, 0.8 ], "targets": [ 1 ], "activationFunction": "sigmoid", "lossFunction": "mse", "learningRate": 0.1, "seed": "backprop-default"}Higher learning rate
Section titled “Higher learning rate”{ "layerSizes": [ 2, 2, 1 ], "inputValues": [ 0.5, 0.8 ], "targets": [ 1 ], "activationFunction": "sigmoid", "lossFunction": "mse", "learningRate": 0.5, "seed": "backprop-default"}Pseudocode
Section titled “Pseudocode”function backpropagation(network, input, target, learningRate): // Forward pass activations = forwardPass(network, input) loss = computeLoss(activations[-1], target)
// Backward pass — compute gradients delta = lossDeriv(activations[-1], target) for layer in reversed(range(num_layers - 1)): weightGrad = delta × activations[layer].T biasGrad = delta delta = (weights[layer].T × delta) * activDeriv(z[layer])
// Update weights weights[layer] -= learningRate * weightGrad biases[layer] -= learningRate * biasGradKey Concepts
Section titled “Key Concepts”Chain Rule of Calculus
Section titled “Chain Rule of Calculus”Backpropagation is an efficient application of the chain rule. To find how a weight in an early layer affects the loss, we multiply the local gradients along the path from the loss back to that weight: ∂Loss/∂w = ∂Loss/∂a × ∂a/∂z × ∂z/∂w.
Gradient Descent
Section titled “Gradient Descent”Gradients indicate the direction of steepest increase. To minimize the loss, we move weights in the opposite direction of the gradient: w_new = w_old - η × ∂Loss/∂w, where η is the learning rate.
Forward Then Backward
Section titled “Forward Then Backward”Training requires two passes: a forward pass to compute predictions and loss, then a backward pass to compute gradients. Both passes have the same computational cost — O(number of weights).
Common Pitfalls
Section titled “Common Pitfalls”- Learning rate too large: A learning rate that is too large causes weight updates to overshoot the minimum, potentially diverging. Too small means painfully slow convergence.
- Vanishing gradients in deep networks: When using sigmoid or tanh activations, gradients shrink exponentially as they propagate backward through many layers, making early layers almost impossible to train.
Q1: What mathematical tool does backpropagation use to compute gradients?
- A) Integration by parts
- B) The chain rule of calculus
- C) Taylor series expansion
- D) Fourier transform
Show answer
Answer: B) The chain rule of calculus
Backpropagation applies the chain rule to compute the gradient of the loss with respect to each weight by multiplying local gradients along the computation path.
Further Reading
Section titled “Further Reading”- Learning Representations by Back-Propagating Errors (Rumelhart et al., 1986) (paper)
- Backpropagation — 3Blue1Brown (video)