If you have been following my previous tutorials, you must be familiar with other optimization techniques to solve ML problems. In this tutorial, I am going to introduce you to a numerical optimization technique called stochastic gradient descent which is widely used in modern machine-learning models such as neural nets. This tutorial is very important and will serve as the foundation for deep learning-related tutorials. Please don’t skip this.
Now, why does SGD stand out? Well, unlike some other methods that might be a bit slow and meticulous, SGD is like a speedy learner. It adapts quickly, updating its knowledge with every new example it sees. This makes it stand out in the crowd, especially when dealing with massive datasets where efficiency matters.
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning. SGD has been successfully applied to large-scale and sparse machine-learning problems often encountered in text classification and natural language processing. Strictly speaking, SGD is merely an optimization technique and does not correspond to a specific family of machine learning models. It is only a way to train a model
Sklearn
Table of Contents
Prerequisites
- Linear Algebra
- Calculus
- Probability & Statistics
- Python
- Familiarity with Classical Machine Learning Models
What You Will Learn
- Advanced Calculus Basics For SGD
- Batch GD
- True SGD
- Mini-Batch SGD
- Momentum SGD or SGDM
- Adaptative Gradient (AdaGrad)
- RMSprop
- Adam
- SGD Application in Sklearn
Calculus For Stochastic Gradient Descent
Before we describe the concepts related to gradient descent, It’s important to go through some of the calculus basics that are important for this optimization technique. Please go through the file below where I have mentioned everything you need to know then move on to the next section. SGD is an optimization technique to find the optimum of the loss function so, it’s more mathematical rather than conceptual and is very easy to understand. Please don’t share this file without our permission.
Intuition For Gradient Descent Or Batch GD
Imagine we are explorers on a hilly landscape, searching for the lowest point—the global minimum. This landscape symbolizes the cost function of a machine-learning model. Our objective is to navigate down the slope, iteratively adjusting our position to find the optimal location. The cost function tells us how much our model’s guesses differ from the real answers. We want to minimize this difference.
Gradient Descent is like guiding a ball down the slope. We want to adjust our model to move downhill, making it better with each step. It’s like steering the ball to the lowest point on the hill – the best possible performance for our model. We need to know which way is downhill, right? In machine learning, this is the slope of our cost function hill. We call this slope the “gradient.” It’s like saying, “Hey, move this way to get better!”
Now, we want to take steps to reach the lowest point. Imagine we’re adjusting our model a little bit each time. The size of these steps is called the “learning rate.” If it’s too big, we might skip over the best point. If it’s too small, it takes forever to get there. Choosing how big our steps are is super important.
We need to watch the scenery – in our case, the cost function – as we go. If it’s not changing much anymore, we’ve probably found the best spot. Our model is then at its best possible form!
If our hill is weirdly shaped, with steep and shallow parts, we might struggle. Normalizing or standardizing our input data makes the journey smoother. This is what gradient descent is in simple words. That’s all!
One thing you should keep in mind here is that the gradient direction is uphill but here our goal is to minimize the cost function so we move in the negative or oppositive direction of the gradient. This is the reason why you will see the negative sign in the gradient descent equation.
Gradient Descent Derivation:
True SGD
Now let’s come back to our main topic stochastic gradient descent (SGD) which extends the idea of Gradient Descent. In regular Gradient Descent, we update the model parameters by computing the gradient of the cost function with respect to the parameters over the entire dataset. This process can be computationally expensive for large datasets.
SGD involves random sampling of data points. This introduces a stochastic (random) element, making it computationally more efficient for large datasets. SGD aims to minimize an objective function (often a cost function) by adjusting model parameters. The objective function is typically the average of the individual loss functions over the dataset. In each iteration, only a single random data point (sample) is used.
The gradient is computed using this subset, and the model parameters are updated accordingly. This process is repeated until convergence. The learning rate in SGD is crucial, as it determines the step size for parameter updates. It needs to be carefully tuned to balance convergence speed and stability.
Due to the random sampling of data points, the parameter updates in SGD are noisy. While the noise in updates introduces variability, SGD also benefits from frequent updates. These updates allow the algorithm to explore different regions of the parameter space more rapidly than regular Gradient Descent. The algorithm may occasionally take steps that do not strictly follow the gradient but deviate due to the random sampling of data points or mini-batches (see mini-batch SGD).
In the presence of poor local minima, where the gradient is small or zero in multiple dimensions, SGD’s stochastic updates provide a chance for the algorithm to escape these suboptimal points. In situations where there’s symmetry in the gradient or the cost function, deterministic algorithms might get stuck in a symmetrical configuration. Stochastic updates break this symmetry, aiding in exploration. The noise introduced by stochastic updates acts as a regularizer. It prevents the model from overly relying on specific patterns in the data, which can help generalize better to unseen data.
SGD may exhibit more oscillations in the convergence path compared to regular Gradient Descent. However, it often converges faster, especially in scenarios with redundant or noisy data. In scenarios with noisy or redundant data, SGD’s noisy updates help prevent overfitting by preventing the model from fitting the noise in the training data too closely. Although individual updates are noisy, the overall convergence of SGD can be faster due to the more frequent updates. The algorithm can make progress even before seeing the entire dataset.
MIni-Batch Gradient Descent
What I described above is true SGD, in practice we use mini-batch SGD. Mini-Batch Gradient Descent is a compromise between the efficiency of SGD and the stability of using the entire dataset. Instead of using only one data point or the entire dataset, mini-batch SGD uses a small, randomly selected subset of the data (a mini-batch) in each iteration.
This mini-batch size is a hyperparameter that needs to be chosen. The size of the mini-batch is a hyperparameter that affects the algorithm’s performance. A smaller batch size introduces more noise but can lead to faster convergence, while a larger batch size provides a more accurate estimate of the gradient. Using mini-batches provides a smoothing effect on the updates, which can help avoid the extreme updates that may happen with SGD.
Suppose you have a dataset of 1000 samples. In SGD, you update the model parameters using only one random sample in each iteration. In Mini-Batch GD, you might choose a mini-batch size of 50, and in each iteration, update the model parameters using these 50 random samples.
Batch Gradient Descent | True Stochastic Gradient Descent | Mini-Batch Gradient Descent | |
---|---|---|---|
Definition | An optimization algorithm that calculates the gradient of the entire dataset to update model parameters. | An optimization algorithm that updates model parameters using the gradient of a single randomly chosen data point. | An optimization algorithm that updates model parameters using the gradient of a randomly chosen subset (mini-batch) of the dataset. |
When to Use |
– Suitable for small to medium-sized datasets.
– When computational resources allow for processing the entire dataset in each iteration. |
– Useful when dealing with large datasets.
– Suitable for online learning scenarios with continuous data streams. |
– A compromise between Batch GD and True SGD.
– Efficient for large datasets. – Commonly used in practice for training deep learning models. |
Good Things |
– Guaranteed convergence to the global minimum for convex functions.
– Stable and predictable updates. – Simple to implement and understand. |
– Can escape local minima due to frequent updates.
– Suitable for online learning. – Can handle large datasets efficiently. |
– Balances computational efficiency and stability.
– Smoothing effects on updates, avoiding extreme changes. – Can be parallelized for further speed-up. |
Bad Things |
– Computationally expensive for large datasets.
– Memory-intensive as it uses the entire dataset. – Not suitable for online learning or streaming data. |
– Noisy updates due to single-sample updates.
– Potential for slow convergence, especially in flat regions. – Sensitive to learning rate choice. |
– Requires tuning of mini-batch size.
– Potential for slower convergence compared to Batch GD. – Sensitive to learning rate choice. |
Application of SGD In Sklearn:
Deep Learning Optimizers
Below are some modern optimizers that are the extension of SGD and are used mainly in deep-learning problems. Please make sure to understand them as they will build the foundation for upcoming tutorials on deep learning.
Momentum in SGD or SGDM
Think of momentum as a ball rolling down the hill. As it moves, it gains momentum, allowing it to roll through shallow regions and resist getting stuck. This is the key intuition behind momentum in SGD – it helps the optimization process by incorporating information about the direction of previous updates.
Consider a scenario where the loss landscape has a long, shallow valley. In standard SGD, the updates might oscillate or take small steps, making slow progress. With momentum, the algorithm accumulates speed in the direction of consistent gradients, allowing it to roll faster through the shallow region and make more substantial progress.
Momentum shines in optimization landscapes with flat or gently sloping regions. It helps the algorithm roll through these areas efficiently. In scenarios with large datasets, momentum can smooth out updates and prevent the algorithm from getting stuck in local minima. It also has some challenges, Choosing the right momentum term (β) requires careful tuning. Suboptimal choices may hinder performance. Momentum might not provide significant benefits in all scenarios. For simple convex problems, the additional complexity may not be necessary.
Nesterov Momentum is an enhancement of the traditional momentum approach. In regular momentum, the gradient is calculated at the current position to update the velocity, and then this velocity is used to perform the parameter update.
In Nesterov Momentum, the idea is to “look ahead” by calculating the gradient not at the current position but at an adjusted position, which is slightly ahead in the direction of the current velocity. This allows the algorithm to anticipate the upcoming update and adjust the velocity accordingly.
- Compute the “look-ahead” position: θ`=θ +βv. θ is the current parameter vector. β is the momentum term. v is the velocity.
- Calculate the gradient at the “look-ahead” position: ∇J (θ`)
- Update the velocity: vt+1=βvt + α ∇J (θ`). α is the learning rate.
- Update the parameters: θt+1=θt − vt+1
Nesterov Momentum often helps to reduce oscillations and overshooting, leading to faster convergence in certain scenarios.
Adaptative Gradient (AdaGrad)
In deep learning, objective functions may have non-differentiable points (discontinuities, corners, etc.). Traditional gradient methods break down in such cases. The subgradient generalizes the concept of a gradient to non-smooth functions. Subgradient methods are typically applied to optimize convex functions, which are mathematical functions with certain desirable properties. Subgradient Descent is a variant of gradient descent tailored for non-smooth functions.
If you are unaware of what is subgradient please take a moment to read this explanation here and then come back.
AdaGrad is a family of sub-gradient algorithms for stochastic optimization. Adagrad adapts the learning rates for each parameter during training, allowing it to perform larger updates for infrequently occurring features and smaller updates for frequently occurring features. This helps in handling sparse data and converging faster in certain dimensions.
In practice, Adagrad is effective in scenarios where features have different frequencies.
The key point is that as the optimization progresses, Gt,ii accumulates information about the historical gradients for each parameter. Since it appears in the denominator, the learning rate becomes smaller for parameters that have larger accumulated gradients. This means that parameters with frequent and large updates will have smaller learning rates, and parameters with infrequent and small updates will have larger learning rates. Here is an example:
This adaptability helps Adagrad to handle sparse data and adjust the learning rates for each parameter individually. However, one drawback of Adagrad is that the learning rates can become very small over time, potentially leading to slow convergence. This issue has led to the development of other adaptive optimization algorithms like RMSprop and Adam, which address some of the limitations of Adagrad.
RMSprop – Root Mean Square Propagation
RMSprop is a modification of Adagrad to address its tendency to excessively reduce the learning rates. It uses a moving average of squared gradients to normalize the learning rates. It adapts the learning rates like Adagrad but mitigates its aggressive reductions.
The main idea behind RMSprop is to adapt the learning rates for each parameter during training. It does so by dividing the learning rate for each parameter by the root mean square (RMS) of the recent gradients for that parameter. This helps to give smaller learning rates for parameters with large and frequent updates and larger learning rates for parameters with small and infrequent updates. RMSprop was introduced by Geoffrey Hinton in his lecture on neural networks for machine learning.
Initialization:
- Initialize model parameters: θ
- Initialize a moving average variable E[g²]: 0
- E[g<sup>2</sup>] is the moving average of squared gradients
Iterative Update:
For each iteration t:
- Compute the gradient g_t of the loss function with respect to the parameters.
- Update the moving average of squared gradients:
E[g²]_t = β * E[g²]_{t-1} + (1 - β) * g_t²
where β is a hyperparameter typically set to a value like 0.9.
- Update the parameters:
θ_{t+1} = θ_t - η / (√(E[g²]_t) + ϵ) * g_t
where η is the learning rate and ϵ is a small constant added for numerical stability.
Code language: HTML, XML (xml)
Key Components and Benefits:
- Adaptive Learning Rates:
- RMSprop adapts the learning rates for each parameter individually based on the history of gradients. Parameters with large gradients get smaller effective learning rates, and parameters with small gradients get larger effective learning rates.
- Normalization:
- The division by the root mean square of the recent gradients helps to normalize the updates, making the algorithm less sensitive to the scale of the gradients.
- Stability:
- The inclusion of the small constant ϵ in the denominator helps prevent division by zero and improves the stability of the algorithm.
Adam (short for Adaptive Moment Estimation)
It combines ideas from two other optimization algorithms: RMSprop (Root Mean Square Propagation) and Momentum. The name “Adam” is derived from the phrase “adaptive moment estimation.” Adam has been shown to work well in practice for a wide range of deep learning tasks and is known for its robustness and efficiency.
Here is a simple example workout.
That’s all for now. I will talk about their application in deep learning tutorials.
Further Reading:
- https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
- https://optimization.cbe.cornell.edu/index.php?title=Adam
- https://optimization.cbe.cornell.edu/index.php?title=RMSProp
- https://arxiv.org/pdf/2301.11235.pdf