Gradient Descent in Machine Learning - Overview
- What is Gradient Descent?
- How It Works
- Types of Gradient Descent
- Summary Comparison Table
- Gradient Descent Used in Transformer Architecture
- Challenges and Solutions
- Challenges
- Local Minimum
- Saddle Point
- Summary
Gradient descent is one of the most important algorithms in machine learning and optimization. It It is used in supervised learning to minimize errors (loss function) and improving performance.
This article comes primarily from ChatGPT with some modifications on my part. I hope to make it more personal in the future
What is Gradient Descent?
Gradient descent is an optimization algorithm. It’s used to find the minimum of a function—typically a cost function or loss function in machine learning. The goal is to adjust the model’s parameters (like weightsin a neural network) so that the predictions get closer to the actual values. This adjustment is performed through the update rule.
Think of it like hiking down a hill in the fog. You can’t see the bottom, but you can feel the slope beneath your feet and step in the direction that goes downward. That’s exactly what gradient descent does: it follows the slope of the cost function to reach the lowest point.
Cost & loss function
The cost (or loss) function measures the difference, or error, between actual y and predicted y at its current position.
The model will use it to adjust the parameters to minimize the error and find the local or global minimum. It will continuously iterates, moving along the direction of steepest descent (or the negative gradient) until the cost function is close to or at zero. At this point, the model will stop learning.
Cost function and loss function have a slight difference between them.
- loss function refers to the error of one training example
- cost function calculates the average error across an entire training set

Reference: IBM - What is gradient descent?
Loss Function (error quantification)
The loss function measures how far off a single prediction is from the actual result. It quantifies the “error” for one example.
Example: In regression tasks, a common loss function is Mean Squared Error (MSE) for one data point: \(\begin{aligned} \text{Loss} = (y_{\text{true}} - y_{\text{pred}})^2 \end{aligned}\)
Cost Function (holistic view of the model’s performance)
The cost function is the average of the loss function over the entire dataset. It gives a holistic view of the model’s performance.
If you have m examples:
\(\begin{aligned}
J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (y^{(i)}_{\text{true}} - y^{(i)}_{\text{pred}})^2
\end{aligned}\)
In short:
- Loss = error for one example
- Cost = average loss over all examples
How It Works
Here’s a simplified breakdown:
- Initialize parameters (randomly or with some strategy).
- Compute the gradient — the vector of partial derivatives that points in the direction of steepest increase.
- Update the parameters by moving them in the opposite direction of the gradient.
- Repeat this process until the function converges (i.e., further updates no longer significantly reduce the cost).
The Update Rule
For a parameter θ, the update rule looks like this: \(\begin{aligned} \theta_i = \theta_i - \alpha \cdot \nabla J(\theta) \end{aligned}\)
Where:
- θ is the parameter vector.
- α is the learning rate — a small constant that controls the step size.
- ∇J(θ)is the gradient of the cost function with respect to θ.
Types of Gradient Descent
There are three types of gradient descent learning algorithms: batch gradient descent, stochastic gradient descent and mini-batch gradient descent.
- Batch Gradient Descent: Uses the entire dataset to compute the gradient. It’s accurate but can be slow for large datasets.
- Stochastic Gradient Descent (SGD): Updates parameters using one training example at a time. It’s faster but more erratic.
- Mini-Batch Gradient Descent: A compromise that uses a small, random subset of the data. It’s widely used in practice.
Summary Comparison Table
| Type | Data per Update | Computational efficiency | Speed for large dataset | Convergence Stability | Memory Usage | Use Case |
|---|---|---|---|---|---|---|
| Batch Gradient Descent | All data | Good (batch) | Slow (store all of the data into memory) |
High | High | Small datasets |
| Stochastic Gradient Descent (SGD) | 1 sample | Bad (Updates each training example’s parameters one at a time) |
Quick (easier to store in memory) |
Low (noisy) | Low | Online learning, huge data |
| Mini-Batch Gradient Descent | Small subset | Balance between batch and Stochastic | Balance between batch and Stochastic | Medium-High | Medium | Deep learning, practical use |
Gradient Descent Used in Transformer Architecture
Transformers — like BERT, GPT, T5, etc. — use gradient descent (specifically variants of it like Adam) for training.
How?
- Each transformer model has millions (or even billions) of parameters.
- During training, we:
- Pass input sequences through the model.
- Calculate the loss between the predicted output and the target output (like next word in a sentence).
- Use backpropagation to compute gradients of the loss with respect to all parameters.
- Update the parameters using gradient descent-based optimizers (e.g., Adam, AdamW).
So while the architecture is complex, the core optimization method is still gradient descent, just highly optimized for performance and scale.
Difference between Adam and standard gradient descent algorithm
In the standard gradient descent algorithm, the learning rate α is fixed, meaning we need to start at a high learning rate and manually change the alpha by steps or by some learning schedule.
This lead to several challenge:
- A lower learning rate at the onset would lead to very slow convergence,
- A very high rate at the start might miss the minima.
Adam solves this problem by adapting the learning rate α for each parameter θ, enabling faster convergence compared to standard gradient descent with a constant global learning rate.
Reference: builtin.com - Complete Guide to the Adam Optimization Algorithm
Challenges and Solutions
Gradient descent comes with its own set of challenges, depending notably if it a convex problems or not
Convex function/ Problems:
- Only one minimum, which is also the global minimum.
-
Gradient descent will always reach that bottom point as long as the learning rate is properly set.
| \|/ \ / \_/
Non-Convex Problems
Most real-world machine learning problems (especially deep learning) are non-convex, meaning:
- The cost surface can have multiple valleys (local minima) and flat regions (saddle points).
-
Gradient descent might not reach the best solution.
__ _ ___ / \__ / \ / \ \_/ \_/ \__
Challenges
When the slope (gradient) ≈ 0, gradient descent stops or slows down — this is interpreted as a “minimum” (but not necessarily the best one).
Three places this can happen:
Global Minimum
-
The ideal point.
-
Gradient is zero and cost is the lowest possible.
Local Minimum
- Gradient is zero, but it’s not the best spot.
- The curve still looks like a “U”, but it’s a smaller dip than the global minimum.
Saddle Point
- The cost function looks like a saddle:
- Going one direction: slope goes up (a local maximum).
- Going the other direction: slope goes down (a local minimum).
- Gradient ≈ 0 at the saddle’s center → confuses gradient descent into thinking it’s stuck.

Summary
Gradient descent is powerful but not without its issues:
- Learning rate tuning: Too small and it’s slow; too large and it might overshoot the minimum.
- Local minima: In complex models, the function might have many minima. The algorithm might get stuck.
- Plateaus and saddle points: Areas where the gradient is close to zero but not at a minimum.
To deal with these, several enhancements exist, such as:
- Momentum: Adds a fraction of the previous update to the current one.
- RMSProp and Adam: Adaptive learning rates that adjust over time for each parameter.

FAQ
Stochastic gradient descent
What is the stochastic gradient descent principle?
-
Start with some initial θ’s (for example random or null)
-
Select 1 training sample randomly in the set and perform the update of the θ’s with this example.
[\begin{aligned} θi ← θi − α(h_θ (x_nn ) − y_n )x_n,i \end{aligned}]
- Loop in 2 until convergence.
What about the convergence with stochastic gradient descent?
The convergence is trickier to observe as we may select favorable and less favorable points in the training set. In practice we may compute an “averaged” cost function over the past U updates. Typically U<=N and U«N if N is really large. We never actually converges like batch gradient descent does, but ends up wandering around close to the global minimum. In practice, this isn’t a problem - as long as we are close to the minimum that’s probably OK
Stochastic gradient descent properties? And its advantages?
We update the parameters for each training samples, which is faster We “generally” move in the direction of the global minimum, but not always Advantages:
- It can handle very large sets of data
- It allows for incremental learning, i.e. on-the-fly adaptation of the model on new incoming data.
- The learning principle is generalisable to many other forms of “hypothesis families”.
List some advantages of gradient descent algorithm
- It can handle very large sets of data (especially the stochastic or mini-batch form of gradient descent)
- It allows for incremental learning, i.e. on-the-fly adaptation of the model on new incoming data
- The learning principle is generalisable to many other form of “hypothesis families” hθ (x), including neural networks, deep neural networks, etc.
What are the “Full” batch gradient descent principles?
- Start with some initial θ’s (for example random or null)
- Visit the full training set to compute new values of the θ reducing J (θ)
- Loop in 2 until convergence The new values of θ’s are chosen according to the “gradient” of J (θ), i.e. in the opposite direction of the slope
What is the gradient descent update rule?
The parameter theta is replaced by its last value plus a step alpha in the opposite direction (minus sign) of the gradient of the cost function.
What is mini-batch gradient descent and what are its advantages?
We select randomly b samples on which we perform batch gradient descent Advantages:
-
The adaptation noise of stochastic gradient descent is reduced.
-
It allows for vectorised implementations (more efficient, use of gpu)
-
It allows to distribute mini-batches on several machines or cores (map/reduce pattern):
map = distribute the mini-batches, for example 50 reduce = update parameters with the results of the 50 mini-batches A disadvantage is that the batch size b needs to be optimised
Reference
- IBM - What is gradient descent?
- builtin.com - Complete Guide to the Adam Optimization Algorithm
- Wikipedia - Gradient descent
- Deep Learning — Part 2: Gradient Descent and variants
- ChatGPT with the input “Write me an article about gradient descent, “a) Explain cost function and loss function b) Write all math formula in mathjax latex for markdown (bracket) C) Is Gradiant descent use in transformer architecture ?”