What Andrej Karpathy Taught Me About Training Neural Networks

In 2018, Andrej Karpathy wrote a Twitter thread outlining the most common mistake when training a neural network.

A short while after that initial thread, Karpathy wrote a blog post called “A Recipe for Training Neural Networks.” He builds this recipe on two key principles: neural networks are leaky abstractions, and neural networks fail silently.

Training neural networks can be difficult because there are many ways that things can be misconfigured, and it can be hard to tell when this has happened. If a neural network is misconfigured, it may train but not work as well as it should. When that happens, it could be hard to detect.

How do we avoid these silent failures?

Karpathy advises against a “fast and furious” approach to training neural networks because that approach will lead to problems. Instead, he suggests being “thorough, defensive, and paranoid” and using visualizations to help identify potential issues because patience and attention to detail are the most important qualities for success in deep learning.

Karpathy presents a six-part recipe for training your neural networks:

  1. Become one with the data
  2. Set up an end-to-end training/evaluation + get dumb baselines
  3. Overfit
  4. Regularize
  5. Tune
  6. Squeeze out the juice

In this post, I’ve summarized the spirit of each recipe step and, where I can, added some practical advice for executing each step.

Become one with the data

The first step in Karpathy’s recipe for training a neural network: don’t start by training a neural network.

Instead, inspect your data.

Thoroughly.

This means looking for patterns, checking for data imbalances or biases, and paying attention to how you classify the data. You can also use simple code to sort and visualize the data and look for outliers, which often reveal problems with the data or preprocessing. This step is crucial because the neural network is essentially a condensed version of your dataset, and if it’s giving you predictions that don’t align with what you see in the data, something needs to be fixed.

Karpathy advises spending as much time as possible (hours, even) going through thousands of examples and understanding their distribution to ensure the best results.

Here are some ideas for becoming one with the data:

  • Visualizing your data: You can use tools such as Matplotlib or OpenCV to display the images and get a sense of what they look like. You can also create a montage of images to visualize many examples. Another option is using a tool like Kangas to explore your data.

  • Analyzing image properties: You can compute the mean, standard deviation, and range of pixel values in the images to understand their distribution. You can also compute image sizes and aspect ratios to see if there are any variations in these properties.

  • Examining the labels: Check for errors or inconsistencies in the labels and imbalances in the distribution of classes.

  • Analyzing image metadata: If available, you can examine metadata such as EXIF data to get a sense of the conditions under which the images were captured.

  • Preprocessing the data: Consider whether you need to resize, crop, or apply other transformations to the images to make them more suitable for your model.

  • Analyzing class balance: If you have imbalanced classes (i.e., one class is significantly larger than another), this can impact the model’s ability to learn from the data.

Set up an end-to-end training/evaluation + get dumb baselines

Before you start using a complicated machine learning model, ensure that you have a solid foundation.

You do this by developing a system to train and evaluate simpler models. This helps ensure everything works correctly and allows you to try different things and see what works best for your data. Keep track of the model’s accuracy and how well it’s doing, and try out different model versions to see what happens.

This helps you understand your data better and ensure you are on the right track before using a more advanced model.

Here’s how you can set up a system to train and evaluate simpler models:

  1. Split your data into training and testing sets. It’s important to evaluate your model on data it hasn’t seen before, so it’s a good idea to set aside a portion of your data for testing.
  2. Choose a simple model, such as a linear classifier or a small convolutional neural network.
  3. Train the model on your training data and track the training loss and other relevant metrics, such as accuracy.
  4. Evaluate the model on your testing data and track the testing loss and other relevant metrics.
  5. Visualize the results of the training and testing to get a better understanding of how well the model is performing.
  6. Consider trying out different model versions, such as changing the network size or using different hyperparameters and comparing the results. This is known as ablation testing and can help you understand which factors are most important for model performance.

By setting up a system to train and evaluate simpler models, you can better understand how well your model performs and identify areas for improvement before moving on to more advanced models.

Karpathy lays out several tips and tricks for this stage.

These are my favourite ones:

  1. When you first train your model, it’s important to check that the loss value starts at the correct number.

This is called “verifying the loss at initialization.” You can do this by looking at the loss value and comparing it to a known good value. For example, if you are using a softmax function on your final layer and have set it up correctly, the loss value should be -log(1/n_classes) at the start of training. This same method can be used to check the loss value for other losses like L2 regression and Huber losses.

These losses also have default values you can use to check your initialization.

  1. It’s important to initialize the weights of your final layer correctly.

This can help your model learn faster and avoid a “hockey stick” loss curve, where the loss value changes significantly in the first few iterations of training. To initialize the final layer weights correctly, you can consider the characteristics of your data. For example, if the values you are trying to predict have a mean of 50, you can initialize the final layer bias to 50. Suppose you have an imbalanced dataset with more negative than positive examples. In that case, you can set the bias on your logits so that the network initially predicts a probability of 0.1 (assuming a ratio of 1:10).

These types of careful initialization can help your model converge more quickly and avoid having to spend time learning the bias in the early stages of training.

  1. Try to overfit a single batch of just a few examples to check your model’s capacity.

To overfit the batch, you can increase the complexity of your model by adding layers or filters. It is best to verify that you can reach the lowest possible loss value, typically zero. It’s also a good idea to visualize the labels and predictions in the same plot and ensure they are perfectly aligned when the loss value is at its minimum.

If they are not aligned, your model may have a bug, and you should troubleshoot before moving on to the next stage.

  1. Use backpropagation for debugging.

You do this by setting a simple loss function, like the sum of all outputs of example i. Then, you can run the backward pass to the input and ensure that you get a non-zero gradient only on the i-th input.

This can help you identify problems in your code, like if you’ve used the wrong function (e.g. view instead of transpose/permute) and accidentally mixed information across the batch dimension.

Here’s how you can do that in code:


import torch

# 1. Create a multi-batch input
x = torch.rand([4, 3, 224, 224])

# 2. Set your input to be differentiable
x.requires_grad = True

# 3. Run a forward pass
out = model(x)

# 4. Define the loss as depending from only one of the inputs
loss = out[2].sum()

# 5. Run a backprop
loss.backward()

# 6. Verify that only x[2] has non-null gradients
for i in range(4):
  if i != 2:
    assert (x.grad[i] == 0.).all()
  else:
    assert (x.grad[2] != 0).any()

Overfit

You should now have a good understanding of your dataset and a functional training and evaluation pipeline.

You can use this to compute a trustworthy metric for any given model. You should also have a baseline performance for an input-independent model, some dumb baselines (which you should aim to outperform), and an idea of how a human would perform on the task (which should be your goal). Now it’s time to iterate on finding a good model.

Karpathy’s approach to this involves two stages: f

  1. Create a model that is large enough to overfit the training data
  2. Then, regularize it to improve performance on new data.

It’s important to do this in these two steps because if you can’t get a low error rate with any model, it may indicate problems or bugs in your implementation.

Here are two of my favourite tips from this part of the post:

  1. Choose a simple and well-established model that has already performed well on similar data.

This will help you get a baseline result and ensure a solid foundation to build upon. Don’t try to be too creative or experimental in the early stages of your development. This will lead to overcomplication and suboptimal performance. Instead, try to find a model used successfully in a similar context and use that as a starting point. For example, if you are classifying images, consider using a ResNet-50 model, which has been widely used and has shown good results in many image classification tasks.

You can always try something more custom or experimental later once you understand the data and the problem better.

Before going on my second favourite trick, I want to explain a vocab term: Learning rate decay.

This a technique to gradually reduce the learning rate over time during training. It’s useful because it can help your model to converge on a solution quickly and efficiently. Let me explain how it works using an analogy.

Imagine that you are trying to find your way through a maze.

The learning rate is like the size of your steps. A high learning rate would be like taking large steps, while a low learning rate would be like taking small steps. Taking larger steps might get you to the end of the maze faster, but you are more likely to make mistakes and get lost. On the other hand, taking smaller steps might take longer, but you are less likely to make mistakes and are more likely to find the correct path through the maze.

Learning rate decay is like gradually decreasing the size of your steps as you get closer to the end of the maze.

This helps ensure you don’t overshoot the exit and end up lost while still making progress toward the end of the maze.

Now, on to my second favourite tip from this section…

  1. Be cautious when using learning rate decay because different decay schedules may be necessary for different problems.

Using a decay schedule based on the current epoch number may not be appropriate for your dataset. For example, if you are not training on ImageNet, using a decay schedule based on epoch 30 of ImageNet training may not be appropriate for your dataset. Using an inappropriate decay schedule could cause the learning rate to be driven to zero too early, preventing the model from converging.

It is a good practice to disable learning rate decay and tune the constant learning rate at the end of the training process.

Regularize

The tips Karpathy shares here are straightforward.

• Get more data
• Use augmentations (and get creative if you have to)
• Use a smaller batch size
• etc

I recommend checking out the full blog post for more details here.

Tune

He shares two tips here.

  1. Random over-grid search

Why Random search over grid search?

Random search is often preferred over grid search because neural networks are often sensitive to some hyperparameters but not others.

This means that the optimal values for some hyperparameters may be very different from those for others. In this case, it’s more important to focus on exploring the full range of possible values for the sensitive hyperparameters rather than trying every combination.

This is what random search does, making it a more effective method for finding the best hyperparameter values for a neural network model.

  1. Hyperparameter optimization: Karpathy suggests exploring a wide space of values for your hyperparameters (or use an intern :laughing:)

Squeeze the juice

By this point, you will likely find the best combination of architecture and hyperparameters.

You’re ready to squeeze the last bit of performance from your model. To get there, Karpathy suggests:

  1. Using ensembles
  2. Train for many epochs

Ensemble models combine the predictions of multiple models.

This can often lead to a small improvement in accuracy. Karpathy claims it’s an automatic way to a 2% improvement. If you don’t have the resources to run an ensemble of models at test time, you can use a technique called "knowledge distillation.”

Knowledge distillation in deep learning is a technique for transferring knowledge from a complex model, called the “teacher,” to a simpler model, called the “student.”

It works by having the teacher model make predictions on a dataset and then training the student model to match those predictions. This is like how a teacher might train a student by providing them with the answers to a set of problems and then having the student try to solve similar problems independently. The student model can learn from the teacher’s “knowledge” and achieve similar performance with a simpler architecture, which can be more efficient.

You can use a tool like SuperGradients to perform knowledge distillation easily.

Leaving a model training for longer than necessary is also a good trick to try.

Sometimes, a model’s performance on a validation dataset will start to level off, but if you let it keep training, it can sometimes improve further. This is especially true if you have many data and a complex model. One example is when a model is left training over a long period, like over a winter break, and when it is checked again, it has surpassed the current state-of-the-art performance on the task it was trained for.

Conclusion

I admit.

I haven’t added anything new here. This is just me chewing on the wisdom from Andrej’s blog, digesting it, putting it into my own words to help me understand it, and adding in some details via research where I needed it.

I learn something best when I can digest it, distill it, and express it. If you’re the same, hit me up and consider contributing some written content to our Deep Learning Daily community. This process has helped me better understand what he’s communicating in the blog and helped deepen my understanding of how to train a neural network properly.

I hope it’s helped you in some way.

4 Likes

It’s a treat to read and see all of this knowledge written to this long rich and very informative blog-post, from a 250-280 characters tweet, nice digestion of a healthy light looking recipe!
As I’m more enlightened, I will definitely take these steps in consideration in my learning/relearning journey.
Thank you for sharing this, as usual you are very inspiring Harpeet !

2 Likes

Thanks @Raouf! Glad you enjoyed it. Feel free to share your notes on topics with the community if you’d like!

2 Likes

I am printing this article and will keep it on my desk. Such a detailed post covering every aspect of training a NN. This would be my go-to reference sheet!

:raised_hands: @harpreet.sahota

2 Likes

Karpathy is a legendary teacher and explainer.

I know you’re working on a deep learning from scratch series. You might find this super informative and beneficial, also from AK:

1 Like

Amazing! This was a worthwhile read, for sure!

2 Likes