In December of 2015, a paper was published that rocked the deep learning world.
This paper is widely regarded as one of the most influential papers in modern deep learning and has been cited over 110,000 times.
The name of this paper?
Deep Residual Learning for Image Recognition (aka, the ResNet paper).
Prevailing wisdom of the time suggested adding more layers to neural networks would lead to better results.
But researchers observed that the accuracy of deep networks would increase up to a saturation point before levelling off.
In addition to that, an unusual phenomenon was observed: Adding layers to an already deep network, the training error would actually increase.
This was primarily due to two problems:

Vanishing/exploding gradients

The degradation problem
The vanishing/exploding gradients problem is a by product of the chain rule.
The chain rule multiplies error gradients for weights in the network.
Multiplying lots of values that are less than one will result in smaller and smaller values.
As those error gradients approach the earlier layers of a network, their value will tends to zero.
This results in smaller and smaller updates to earlier layers (not much learning happening).
The inverse problem is the exploding gradient which happens when large error gradients accumulate during training and result in massive updates to model weights in the earlier layers.
The degradation problem is unexpected, because its not caused by overfitting.
Researchers were finding that as networks got deeper, the training loss would decrease but then shoot back up as more layers were added to the networks.
Which is counterintuitive…
Because you’d expect your training error to decrease, converge, and plateau out as the number of layers in your network increases.
Both of these issues threatened to halt progress of deep neural networks, until this paper came out…
The ResNet paper introduced a novel solution to these two pesky problems that plagued the architects of deep neural networks:
The Skip Connection.
Skip connections, which are housed in residual blocks, allow you to take the activation value from an earlier layer and pass it to a deeper layer in a network.
Skip connections enable deep networks to learn the identity function.
Learning the identity function allows a deeper layer to perform as well as an earlier layer, or at the very least it won’t perform any worse
The result is smoother gradient flow, ensuring important features are preserved in the training process.
The invention of the skip connection has given us the ability to build deeper and deeper networks while avoiding the problem of vanishing/exploding gradients and degradation.
Wanna see a ResNet in action? Check out this short notebook that I’ve prepared for you using the SuperGradients training library.
You’ll perform transfer learning on the MiniPlaces dataset and perform inference on unseen images👇🏽
Checkout the recording of the live session here:
Follow along the notebook here: Google Colab