Learning Rate Warmups
The learning rate is undoubtedly one of the most important hyperparameters for training a deep neural network.
The learning rate controls the step size for updating a model’s weights during training. If you choose too large, you’ll cause your model to overshoot the optimal solution. If you choose too small, you’ll cause your model to converge too slowly. Finding the right learning rate is crucial for the model to converge.
But using a constant learning rate for the entire training process will cause your model’s error to oscillate around the optima.
As the model approaches an optimum, it should take smaller steps, but a constant learning rate could cause the model to take too large steps. Smaller steps don’t completely solve the problem, as the oscillations are still present but smaller. Increasing the learning rate can worsen the problem, causing the model to jump out of a nice valley with a low minimum and into a new valley with a much larger minimum. It can be difficult to find a learning rate that is fast enough but doesn’t overshoot valleys or get stuck bouncing around at the bottom.
An alternative is to change the learning rate as the training process progresses.
One such technique is the learning rate warmup, introduced in the 2017 paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” by Priya Goyal et al.
Warmup uses a relatively small step size during the initial phases of training. It increases the learning rate (constantly or gradually) to a specified rate throughout several epochs. This scheme works since the model is initialized with random parameters so it will lead to an unstable training process. How can you increase the learning rate during the warmup period?
One way is to increase the learning rate over several epochs constantly. This strategy has a low, constant learning rate for the first few training epochs. The paper’s authors found this strategy helpful for prototyping object detection and segmentation methods that fine-tune pre-trained layers along with newly initialized ones.
However, they found that a sudden increase in the learning rate can cause the training error to spike after the warm-up period.
An alternative warmup strategy is a gradual warmup, which gradually ramps the learning rate from a small to a large value. This gradual increase avoids a sudden increase in the learning rate and allows for healthy convergence at the start of training. Once the warmup period ends, you can resume your original learning rate schedule.
We’ve integrated a LR scheduling callback as a tool in the SuperGradients training library.
In SG this callback helps to gradually increase the learning rate (LR) of a model during training using a method called “linear step warmup” to do this.
During the warmup period, the LR will start at a lower value (warmup_initial_lr) and increase in even steps until it reaches the initial LR value. If warmup_initial_lr is not specified, the LR will start at a lower value which is initial_lr/(1+warmup_epochs) and increase in even steps until it reaches the initial LR value.
This helps the model to gradually adjust to the learning process and avoid any sudden changes in the LR, which can cause the model to perform poorly.