Gradient Descent
Category: Deep Learning
Difficulty: Intermediate
Time Complexity: O(T × d)
Space Complexity: O(d)
Overview
Section titled “Overview”Gradient descent is the optimization algorithm that trains neural networks by iteratively adjusting parameters to minimize a loss function. At each step, it computes the gradient (vector of partial derivatives) of the loss with respect to the parameters, which points in the direction of steepest ascent. Moving parameters in the opposite direction (downhill) reduces the loss. The learning rate controls step size. Variants include SGD (stochastic gradient descent, which uses mini-batches), SGD with momentum (which accumulates velocity to overcome local plateaus), and Adam (which adapts the learning rate per parameter using first and second moment estimates). Gradient descent visualized on a 2D loss landscape shows the optimization trajectory converging toward a minimum.
Try It
Section titled “Try It”- Web: Open in Eigenvue →
- Python:
import eigenvueeigenvue.show("gradient-descent")
Default Inputs
Section titled “Default Inputs”{ "startX": 3, "startY": 3, "learningRate": 0.1, "optimizer": "sgd", "numSteps": 20}Input Examples
Section titled “Input Examples”Default (SGD)
Section titled “Default (SGD)”{ "startX": 3, "startY": 3, "learningRate": 0.1, "optimizer": "sgd", "numSteps": 20}Momentum optimizer
Section titled “Momentum optimizer”{ "startX": -3, "startY": 4, "learningRate": 0.05, "optimizer": "momentum", "numSteps": 30}Adam optimizer
Section titled “Adam optimizer”{ "startX": 4, "startY": -2, "learningRate": 0.1, "optimizer": "adam", "numSteps": 25}Pseudocode
Section titled “Pseudocode”function gradientDescent(f, startParams, learningRate, numSteps): params = startParams trajectory = [params]
for step in range(numSteps): grad = computeGradient(f, params) // ∂f/∂params params = params - learningRate * grad // step downhill trajectory.append(params)
return trajectoryKey Concepts
Section titled “Key Concepts”Gradient as Direction of Steepest Ascent
Section titled “Gradient as Direction of Steepest Ascent”The gradient ∇f(x) is a vector pointing in the direction where the function increases fastest. To minimize, we move in the opposite direction: x_new = x_old - η × ∇f(x_old).
Learning Rate
Section titled “Learning Rate”The learning rate η controls how large each step is. Too large → overshooting and divergence. Too small → slow convergence. Finding the right learning rate is one of the most important hyperparameter choices in deep learning.
Momentum
Section titled “Momentum”Momentum accumulates a velocity vector from past gradients, helping the optimizer build speed in consistent directions and dampen oscillations. It is like a ball rolling downhill that carries momentum past small bumps.
Adam Optimizer
Section titled “Adam Optimizer”Adam (Adaptive Moment Estimation) adapts the learning rate for each parameter individually using running averages of the first moment (mean) and second moment (variance) of gradients. This makes it robust across a wide range of problems.
Common Pitfalls
Section titled “Common Pitfalls”- Local minima and saddle points: Gradient descent can get stuck at local minima or saddle points where the gradient is zero. Momentum and Adam help escape these by maintaining velocity.
- Learning rate too high causes divergence: If the learning rate is too large, the optimizer overshoots the minimum and the loss increases instead of decreasing, potentially diverging to infinity.
Q1: In gradient descent, why do we subtract the gradient rather than add it?
- A) To increase the function value
- B) Because the gradient points uphill and we want to go downhill
- C) To normalize the parameters
- D) Because subtraction is faster than addition
Show answer
Answer: B) Because the gradient points uphill and we want to go downhill
The gradient points in the direction of steepest ascent. Since we want to minimize the loss, we move in the opposite direction by subtracting.
Further Reading
Section titled “Further Reading”- Adam: A Method for Stochastic Optimization (Kingma & Ba, 2015) (paper)
- Gradient Descent — Wikipedia (article)