Gradient Descent

Category: Deep Learning
Difficulty: Intermediate
Time Complexity: O(T × d)
Space Complexity: O(d)

Overview

Gradient descent is the optimization algorithm that trains neural networks by iteratively adjusting parameters to minimize a loss function. At each step, it computes the gradient (vector of partial derivatives) of the loss with respect to the parameters, which points in the direction of steepest ascent. Moving parameters in the opposite direction (downhill) reduces the loss. The learning rate controls step size. Variants include SGD (stochastic gradient descent, which uses mini-batches), SGD with momentum (which accumulates velocity to overcome local plateaus), and Adam (which adapts the learning rate per parameter using first and second moment estimates). Gradient descent visualized on a 2D loss landscape shows the optimization trajectory converging toward a minimum.

Try It

Web: Open in Eigenvue →

Python:

import eigenvue
eigenvue.show("gradient-descent")

Default Inputs

{
  "startX": 3,
  "startY": 3,
  "learningRate": 0.1,
  "optimizer": "sgd",
  "numSteps": 20
}

Input Examples

Default (SGD)

{
  "startX": 3,
  "startY": 3,
  "learningRate": 0.1,
  "optimizer": "sgd",
  "numSteps": 20
}

Momentum optimizer

{
  "startX": -3,
  "startY": 4,
  "learningRate": 0.05,
  "optimizer": "momentum",
  "numSteps": 30
}

Adam optimizer

{
  "startX": 4,
  "startY": -2,
  "learningRate": 0.1,
  "optimizer": "adam",
  "numSteps": 25
}

Code

Pseudocode

function gradientDescent(f, startParams, learningRate, numSteps):
  params = startParams
  trajectory = [params]

  for step in range(numSteps):
    grad = computeGradient(f, params)  // ∂f/∂params
    params = params - learningRate * grad  // step downhill
    trajectory.append(params)

  return trajectory

Key Concepts

Gradient as Direction of Steepest Ascent

The gradient ∇f(x) is a vector pointing in the direction where the function increases fastest. To minimize, we move in the opposite direction: x_new = x_old - η × ∇f(x_old).

Learning Rate

The learning rate η controls how large each step is. Too large → overshooting and divergence. Too small → slow convergence. Finding the right learning rate is one of the most important hyperparameter choices in deep learning.

Momentum

Momentum accumulates a velocity vector from past gradients, helping the optimizer build speed in consistent directions and dampen oscillations. It is like a ball rolling downhill that carries momentum past small bumps.

Adam Optimizer

Adam (Adaptive Moment Estimation) adapts the learning rate for each parameter individually using running averages of the first moment (mean) and second moment (variance) of gradients. This makes it robust across a wide range of problems.

Common Pitfalls

Local minima and saddle points: Gradient descent can get stuck at local minima or saddle points where the gradient is zero. Momentum and Adam help escape these by maintaining velocity.
Learning rate too high causes divergence: If the learning rate is too large, the optimizer overshoots the minimum and the loss increases instead of decreasing, potentially diverging to infinity.

Quiz

Q1: In gradient descent, why do we subtract the gradient rather than add it?

A) To increase the function value
B) Because the gradient points uphill and we want to go downhill
C) To normalize the parameters
D) Because subtraction is faster than addition

Show answer

Answer: B) Because the gradient points uphill and we want to go downhill

The gradient points in the direction of steepest ascent. Since we want to minimize the loss, we move in the opposite direction by subtracting.

Prerequisites

Backpropagation