Skip to content

Gradient Descent

Category: Deep Learning
Difficulty: Intermediate
Time Complexity: O(T × d)
Space Complexity: O(d)

Gradient descent is the optimization algorithm that trains neural networks by iteratively adjusting parameters to minimize a loss function. At each step, it computes the gradient (vector of partial derivatives) of the loss with respect to the parameters, which points in the direction of steepest ascent. Moving parameters in the opposite direction (downhill) reduces the loss. The learning rate controls step size. Variants include SGD (stochastic gradient descent, which uses mini-batches), SGD with momentum (which accumulates velocity to overcome local plateaus), and Adam (which adapts the learning rate per parameter using first and second moment estimates). Gradient descent visualized on a 2D loss landscape shows the optimization trajectory converging toward a minimum.

{
"startX": 3,
"startY": 3,
"learningRate": 0.1,
"optimizer": "sgd",
"numSteps": 20
}
{
"startX": 3,
"startY": 3,
"learningRate": 0.1,
"optimizer": "sgd",
"numSteps": 20
}
{
"startX": -3,
"startY": 4,
"learningRate": 0.05,
"optimizer": "momentum",
"numSteps": 30
}
{
"startX": 4,
"startY": -2,
"learningRate": 0.1,
"optimizer": "adam",
"numSteps": 25
}
function gradientDescent(f, startParams, learningRate, numSteps):
params = startParams
trajectory = [params]
for step in range(numSteps):
grad = computeGradient(f, params) // ∂f/∂params
params = params - learningRate * grad // step downhill
trajectory.append(params)
return trajectory

The gradient ∇f(x) is a vector pointing in the direction where the function increases fastest. To minimize, we move in the opposite direction: x_new = x_old - η × ∇f(x_old).

The learning rate η controls how large each step is. Too large → overshooting and divergence. Too small → slow convergence. Finding the right learning rate is one of the most important hyperparameter choices in deep learning.

Momentum accumulates a velocity vector from past gradients, helping the optimizer build speed in consistent directions and dampen oscillations. It is like a ball rolling downhill that carries momentum past small bumps.

Adam (Adaptive Moment Estimation) adapts the learning rate for each parameter individually using running averages of the first moment (mean) and second moment (variance) of gradients. This makes it robust across a wide range of problems.

  • Local minima and saddle points: Gradient descent can get stuck at local minima or saddle points where the gradient is zero. Momentum and Adam help escape these by maintaining velocity.
  • Learning rate too high causes divergence: If the learning rate is too large, the optimizer overshoots the minimum and the loss increases instead of decreasing, potentially diverging to infinity.

Q1: In gradient descent, why do we subtract the gradient rather than add it?

  • A) To increase the function value
  • B) Because the gradient points uphill and we want to go downhill
  • C) To normalize the parameters
  • D) Because subtraction is faster than addition
Show answer

Answer: B) Because the gradient points uphill and we want to go downhill

The gradient points in the direction of steepest ascent. Since we want to minimize the loss, we move in the opposite direction by subtracting.