Understanding Gradient Descent
An introduction to gradient descent and its applications in machine learning and deep learning.
1. Introduction
Gradient descent (GD) is an iterative first-order optimisation algorithm, used to find a local minimum/maximum of a given function. This method is commonly used in machine learning (ML) and deep learning (DL) to minimise a cost/loss function.
This method was proposed long before the era of modern computers by Augustin-Louis Cauchy in 1847. Since that time, significant development in computer science and numerical methods has led to numerous improved versions of Gradient Descent. However, in this article we're going to use the basic/vanilla version.
2. What is a Cost/Loss Function?
In the context of machine learning, a cost function (also known as a loss function or objective function) is a mathematical formula used to measure the performance of a machine learning model.
The value of the cost function quantifies how well the model’s predictions align with the actual data. The primary goal in training a machine learning model is to minimize this cost function by adjusting the model’s parameters (like weights in a neural network).
Example: Below shows the hypothesis and the cost function for a linear equation in two variables.
Source: Coursera
3. Requirements of a Cost Function
Differentiability
A cost function should ideally be differentiable. This means it has a derivative at each point in its domain. Gradient descent relies on computing these derivatives.
Typical non-differentiable functions include steps, cusps, or discontinuities.
Convexity
In some ML scenarios (e.g., linear/logistic regression), it's beneficial if the cost function is convex. A convex cost function guarantees convergence to a global minimum, rather than getting trapped in local minima.
4. What Does Gradient Mean?
A gradient refers to a vector that points in the direction of the greatest rate of increase of a function.
One-Dimensional: The Derivative
The derivative of a function measures how its output changes with respect to its input.
// Example: f(x) = x²
// Derivative: f'(x) = 2x
// At x = 3, f'(3) = 6Multiple Dimensions: The Gradient
The gradient generalizes the derivative to multiple dimensions. It is a vector of partial derivatives.
Source: Gradient Descent Algorithm: A Quick Guide (analyticsvidhya.com)
5. What Does Gradient Descent Mean?
Gradient Descent minimizes the cost function by moving in the direction of the negative gradient.
Update Rule:
θ = θ - η * ∇f(θ)θ– parametersη– learning rate∇f(θ)– gradient at current parameters
Stopping Criteria:
- Fixed number of iterations
- Minimal change in objective function
- Gradient magnitude below threshold
Contour plots help visualize this in 2D, where concentric rings indicate equal values of the cost function.
6. Types of Gradient Descent Algorithm
Batch Gradient Descent
- Computes the gradient over the entire dataset
- Accurate but slow for large datasets
Stochastic Gradient Descent (SGD)
- Uses one sample at a time
- Fast but noisy
Mini-batch Gradient Descent
- A compromise between batch and SGD
- Commonly used in practice
7. Alpha — The Learning Rate
The learning rate α controls the step size in each update.
θ_new = θ_old - α * ∇f(θ_old)Effects:
- Too Large: May overshoot or diverge
- Too Small: Very slow convergence
- Ideal: Balanced and efficient
Often selected via empirical testing, grid search, or learning rate schedules.
8. Challenges of Gradient Descent Algorithm
1. Learning Rate Sensitivity
Choosing the correct learning rate is critical for stable and efficient convergence.
2. Local Minima & Saddle Points
- Non-convex functions can trap gradient descent in local minima.
- Saddle points are flat points that are neither minima nor maxima.
3. Initial Point Dependence
Different starting points can lead to different outcomes or speeds of convergence.
4. High Dimensionality
- Curse of dimensionality: Parameter space grows exponentially
- Gradient computation becomes more expensive
5. Plateaus and Flat Regions
Very small gradients slow down convergence in flat areas of the loss surface.
6. Noisy Gradients & Non-Smooth Objectives
- SGD introduces variance in updates
- Sharp points in functions make gradients undefined
7. Vanishing & Exploding Gradients
- Common in deep neural networks
- Makes training unstable or stagnant
Gradient descent is the cornerstone of most machine learning algorithms, but understanding its behavior, limitations, and improvements is essential for applying it effectively.