About Gradient Descent

Gradient descent is the optimization algorithm that trains neural networks by iteratively adjusting parameters to minimize a loss function. At each step, it computes the gradient (vector of partial derivatives) of the loss with respect to the parameters, which points in the direction of steepest ascent. Moving parameters in the opposite direction (downhill) reduces the loss. The learning rate controls step size. Variants include SGD (stochastic gradient descent, which uses mini-batches), SGD with momentum (which accumulates velocity to overcome local plateaus), and Adam (which adapts the learning rate per parameter using first and second moment estimates). Gradient descent visualized on a 2D loss landscape shows the optimization trajectory converging toward a minimum.

Complexity Analysis

Time Complexity
O(T × d)
Space Complexity
O(d)
Difficulty
intermediate

Key Concepts

Gradient as Direction of Steepest Ascent

The gradient ∇f(x) is a vector pointing in the direction where the function increases fastest. To minimize, we move in the opposite direction: x_new = x_old - η × ∇f(x_old).

Learning Rate

The learning rate η controls how large each step is. Too large → overshooting and divergence. Too small → slow convergence. Finding the right learning rate is one of the most important hyperparameter choices in deep learning.

Momentum

Momentum accumulates a velocity vector from past gradients, helping the optimizer build speed in consistent directions and dampen oscillations. It is like a ball rolling downhill that carries momentum past small bumps.

Adam Optimizer

Adam (Adaptive Moment Estimation) adapts the learning rate for each parameter individually using running averages of the first moment (mean) and second moment (variance) of gradients. This makes it robust across a wide range of problems.

Common Pitfalls

Local minima and saddle points

Gradient descent can get stuck at local minima or saddle points where the gradient is zero. Momentum and Adam help escape these by maintaining velocity.

Learning rate too high causes divergence

If the learning rate is too large, the optimizer overshoots the minimum and the loss increases instead of decreasing, potentially diverging to infinity.

Prerequisites

Understanding these algorithms first will help: