As a deep learning practitioner, you’ll use optimizers to find model parameters that minimize your loss function, resulting in better predictions.
Choosing the right optimizer is essential because different optimizers are better suited to different problems.
Some popular optimizers include SGD, Adam, and RMSprop. SGD is simple but effective, while Adam and RMSprop may be more computationally intensive but often converge faster. They are all gradient descentbased algorithms, meaning they iteratively update the model weights in the direction that reduces the loss.
Think of optimizers as the steering wheel for your model’s learning process – selecting the right one can mean the difference between aimlessly wandering and smoothly sailing toward your prediction goals.
In this post, I’ll provide an intuitive explanation of SGD, Adam, and RMSProp.
SGD
Since the 2010s, SGD has become a widelyused optimization algorithm in machine learning.
But it has its roots in a 1950s paper by Herbert Robbins and Sutton Monro called A Stochastic Approximation Method. The RobbinsMonroe algorithm is an iterative method for finding the minimum of a function, and the principles of their algorithm formed the foundation of SGD. SGD is a simple yet effective optimization algorithm that updates the model weights using the loss gradient for the weights computed using a single training example.
The “stochastic” component of the algorithm refers to the fact that the algorithm uses random samples of the data to determine the direction of the descent. By using random samples, the algorithm can make more accurate and efficient updates, as it is not relying on the entire dataset to make decisions.
The “gradient” component of the algorithm refers to the calculation of the gradient of the cost function. The gradient is a vector that points toward the greatest rate of cost function increase. By calculating the gradient, the algorithm can determine the direction it needs to move to minimize the cost.
The “descent” component of the algorithm refers to the fact that the algorithm is moving in the opposite direction of the gradient. By moving in the opposite direction of the gradient, the algorithm can reduce the cost of the function and find the best values for the parameters.
Here’s an intuitive explanation of SGD, stepbystep:
You start by setting your model’s parameters (weights and biases) to random or predetermined values. Then, you’ll go through each training example and:
 Calculate your model’s prediction error (loss) on the example.
 Compute the gradient of the loss concerning the model parameters.
 Update your model parameters in the opposite direction of the gradient, using a step size (learning rate) to control the update size.
You’ll repeat this process for several iterations or until the model parameters converge on a good solution.
But I like to think about it in a more…delicious way.
Imagine you are a chef trying to find the perfect combination of ingredients to make the perfect batch of cookies.
You start with a basic recipe that you think might work, but you’re still determining if it will produce the most delicious cookies. So, you decide to make a small batch of cookies and tastetest them to see how they turn out. If the cookies are too sweet, reduce the sugar in the recipe. If they aren’t sweet enough, add more sugar.
You continue to tweak the recipe and repeatedly taste test until you find the perfect balance of ingredients that yields the Bugatti of Biscuits, the Most Wonderous of Wafers…the Cookie of the Gods!
SGD works similarly.
It starts with a set of initial model parameters (the “recipe”) and then makes minor adjustments to the parameters based on how well the model is doing on a training example (the “taste test”). It repeats this process for a fixed number of iterations or until the model parameters (the “recipe”) converge to a satisfactory solution (the “perfect batch of cookies”).
SGD is commonly used in machine learning and deep learning and is often considered the “default” optimization algorithm for many models.
Here are a few examples where SGD may be a good choice:

Large datasets: SGD can train machine learning models efficiently because it processes training examples one at a time and does not require the entire dataset to be loaded into memory at once.

Efficient hyperparameter tuning: SGD is relatively insensitive to the choice of hyperparameters compared to other optimizers, such as Adam or RMSprop. However, this does not mean that hyperparameter tuning is not important for SGD or that SGD is not affected by choice of hyperparameters. The learning rate is an essential hyperparameter for SGD that can significantly impact the model’s performance.

Sparse data: SGD can be more efficient than batch gradient descent when training on sparse data because it only processes a small number of nonzero data points at a time.

Simple implementation: SGD is a simple optimization algorithm, making it a good choice for prototyping and experimentation.
Important hyperparameters for SGD
Several hyperparameters can be adjusted in SGD to control the learning process. Here are some of the most important ones:

Learning rate
: This determines the size of the step that SGD takes towards the minimum of the loss function. A lower learning rate means that the algorithm will take smaller steps and require more iterations to converge, but it will also be less likely to overshoot the minimum. A larger learning rate means that the algorithm will take larger steps, leading to faster convergence, but it can also cause the algorithm to oscillate or diverge. 
Momentum
: This is the amount of inertia the optimization algorithm has in its step. In other words, it determines the amount of the previous update carried over to the current update. This can help the algorithm overcome shallow local minima, escape from saddle points and help the algorithm converge faster. 
Nesterov momentum
: The basic idea behind Nesterov momentum is to “look ahead” at the direction in which the optimization algorithm moves and use this information to adjust the step size and direction. To do this, Nesterov modulates the standard momentum algorithm by using a “lookahead” gradient to determine the direction in which to take the step rather than the current gradient. 
Learning Date Decay
: This factor is multiplied by the learning rate at each iteration to decrease gradually over time. This can help the algorithm converge more smoothly and avoid overshooting the minimum. An intuition behind this is that we first want to quickly approximate the solution, which is done with a higher learning rate. But when we are close to a minimum, we want to be more precise in our updates, so we need a lower learning rate 
Batch size
: This determines the number of samples used to calculate the gradient of the loss function at each iteration. Larger batch sizes lead to more stable gradients and faster convergence, but it can also be computationally expensive. A smaller batch size can lead to more noisy gradients, but it can also allow the algorithm to be more responsive to changes in the data.
RMSProp
RMSProp is an optimization algorithm that was introduced by Geoff Hinton in a Coursera lecture in 2012 and detailed in a conference paper titled “Lecture 6e: RMSprop: Divide the gradient by a running average of its recent magnitude” published in 2013.
The algorithm is an extension of the famous stochastic gradient descent (SGD) algorithm. The key idea behind RMSProp is to scale the gradient of each weight in the model by dividing it by the root mean square (RMS) of the gradients of that weight. This helps prevent weights with high gradients from learning too quickly while allowing weights with low gradients to continue learning faster.
The result is a more stable and effective training process.
Here’s an intuitive explanation of RMSProp, stepbystep:

Initialize the RMSProp state variable. This state will store the moving average of the squared gradients.

Compute the gradient of the loss function concerning each weight in the model.

Update the RMSProp state variables by calculating the moving average of the squared gradients.

Scale the gradients by dividing them by the squared root of the RMSProp state (e.g. moving average of the squared gradients). This helps prevent weights with large gradients from learning too quickly while allowing weights with small gradients to continue learning faster.

Update the model parameters using the scaled gradients and a learning rate.

Repeat steps 25 for a predetermined number of iterations or until the model reaches convergence.
In a nutshell: RMSProp scales the gradient for each weight based on the magnitude of the error gradients.
If the weight is experiencing large gradients, the error contributing to the prediction is significant, and the weight should be adjusted more slowly. If a weight is experiencing small gradients, it means that the error it contributes is minimal, and the weight can be adjusted more quickly.
Another way to think of RMSProp is as a way to add “friction” to the training process.
Imagine that you’re pulling a box across a floor…
If the floor is very smooth, the package will continue to move even after you stop trying. But if you add friction to the floor (for example, by putting a rug down), the box will eventually stop. In the same way, RMSProp can add “friction” to the training process by decaying the sum of the previously squared gradients.
This helps avoid overshooting the optimal weights for the model.
Overall, RMSProp is a versatile optimization algorithm that can be effective in many training scenarios. It is worth considering as an option when training your machine learning models.
Here are a few heuristics or use cases where RMSProp may be a good choice:

If you are training a model with many parameters and are experiencing issues with the model diverging or oscillating during training, RMSProp can help stabilize the training process by adjusting to the gradient.

If the learning rate is hard to tune, RMSProp can be effective because it scales the gradients, which can help the optimization process converge more smoothly regardless of the learning rate.

If you train a model on a noisy or irregularlyshaped loss function, RMSProp can smooth out shortterm fluctuations and highlight longterm trends, which help mitigate the effects of noise and allow the model to converge more quickly.
Important hyperparameters for RMSProp
Several hyperparameters can be adjusted in RMSProp to control the learning process. Here are some of the most important ones:
Learning rate
Batch size

Decay rate
: This factor is multiplied by the moving average of the square of the gradients at each iteration to decay the RMS value over time. A larger decay rate means that the RMS value will decay faster, making the learning rate more sensitive to recent changes in the gradients. A lower decay rate means the RMS value will decay slower, making the learning rate more stable. 
Epsilon
: This is a small value that is added to the RMS value to avoid division by zero and to stabilize the learning process.
Adam
The Adam optimization algorithm, also known as the Adam gradient descent algorithm, was introduced in a 2015 paper by Diederik Kingma and Jimmy Ba titled “Adam: A Method for Stochastic Optimization.”
Adam stands for adaptive moment estimation and is a stochastic gradient descent optimization algorithm that uses an adaptive learning rate based on estimates of the first and second moments. It maintains exponential moving averages of the weights and gradients, which it uses to scale the learning rate. In other words, Adam uses estimates of the mean and variance of the gradients to adaptively scale the learning rate during training, which can improve the speed and stability of the optimization process.
This allows the learning rate to be adjusted on the fly based on the model’s current state rather than a fixed value.
Here’s an intuitive explanation of Adam, stepbystep:
 Initialize the model weights with some starting values.
 For each training iteration:
 Calculate the gradient of the loss function with respect to the model weights.
 Calculate the exponential moving average of the gradients and the exponential moving average of the squared gradients.
 Use these moving averages to adjust the learning rate for the current iteration.
 Update the model weights using the adjusted learning rate and the gradients calculated in step 2.
 Repeat steps 2 and 3 until the model has converged or the maximum number of iterations has been reached.
Here’s another way to think about it…
Imagine driving a car down a winding road and wanting to get to your destination as quickly and smoothly as possible.
The car’s engine is like the model weights in a machine learning algorithm, and the gas pedal is like the learning rate. As you drive, you need to adjust the gas pedal (learning rate) to find the right balance between going too fast (overshooting your destination) and going too slow (not making progress). If you go too fast, you may overshoot your turn or lose control of the car. On the other hand, if you go too slow, you may miss your destination.
Adam is like a GPS for your car that helps you find the optimal gas pedal setting (learning rate).
It does this by continuously monitoring the car’s speed and the curvature of the road (the gradient of the loss function with respect to the model weights). This adjusts the gas pedal (learning rate) to help you find the fastest and smoothest path to your destination.
Here are a few heuristics or use cases for selecting Adam as your optimizer:
 When you want a fast and efficient optimization algorithm: Adam requires relatively little memory and computation, making it a fast and efficient choice for training deep learning models.
 When you have noisy or sparse gradients: Adam is wellsuited for optimizing models with noisy or sparse gradients, as it can use the additional information provided by the gradients to adapt the learning rate on the fly.
 When you want to try a “plugandplay” optimization algorithm: Adam, a “plugandplay” optimization algorithm, requires relatively little tuning. It is a good choice if you want to train your model quickly.
That said, it’s always a good idea to try a few different optimization algorithms and see which works best for your application. Adam may only sometimes be the best choice, and it’s always worth experimenting to see if you can get better results with a different optimizer.
“In my experience, Adam is much more forgiving to hyperparameters, including a bad learning rate”  Andrej Karpathy.
Important hyperparameters for Adam
You can adjust several hyperparameters in Adam to control the learning process. Here are some of the most important ones:
Learning rate
Epsilon
Batch size

Beta1
andBeta2
: These factors control the decay rate of the moving averages of the gradients and the squared gradients, respectively. Beta1 and Beta2 should be set to values between 0 and 1, with values closer to 1 resulting in a slower decay rate and a longer “memory” of the past gradients.
When to use what
The choice of optimization algorithm will depend on your model’s specific characteristics and data. However, here are a few general guidelines that you can follow.
If you are…

Starting with deep learning and not having much experience with optimization algorithms, you can use Adam or SGD with a moderate learning rate (e.g., 0.00010.01). These algorithms are relatively easy to implement and work well in practice.

Working with a huge dataset, consider using SGD or RMSProp, as they are more efficient in terms of memory and computational requirements.

Working with a very deep neural network (e.g., with many layers), use RMSProp or Adam, as they can help prevent the gradients from getting too large or too small, which can slow down the optimization.

Working with a very noisy or unstable dataset, you should use Adam or RMSProp, as they can help smooth the optimization process and provide more stable convergence.
In general, try out a few different optimization algorithms and see which one works best for your specific problem. You can also use different learning rates and other hyperparameters to see if they impact the optimization algorithm’s performance.
Assessing the impact of training
To assess the impact of using a particular optimization algorithm on your model’s training, you can compare the model’s performance using different optimization algorithms.
Here are a few things you can try:

Train the same model using different optimization algorithms, such as Adam, SGD, and RMSProp. You can use the same hyperparameters (e.g., learning rate) for all the algorithms or adjust them to see if they impact the optimization algorithm’s performance.

Measure the model’s performance using a performance metric relevant to your problem, such as accuracy for a classification task or mean squared error for a regression task.

Compare the model’s performance using different optimization algorithms and see which performs the best. You may also want to compare the convergence speed of the optimization algorithms, i.e., how quickly they reach the optimal solution.
You should use crossvalidation to get a more robust estimate of the model’s performance if you have a large dataset. This involves dividing the dataset into a training set and a test set, training the model on the training set using different optimization algorithms, and evaluating the model’s performance on the test set.
Code example
I put up a notebook on Kaggle because even though I pay for Google Colab, they won’t let me access my GPUs.
Check it out here:
cc: @sGx_tweets @salmankhaliq22 @kausthubk @chris @richmond @kurtispykes @anjineyulutv