Do you have any heuristics for transfer learning?

Here’s some heuristics for transfer learning that I’ve been adhering to. What are your thoughts? Would you add anything else?

1) Your dataset is small and similar to the dataset the model was pre-trained on.

When your images are similar its likely that low-level features (like edges) and high-level features (like shapes) will be similiar.

What to do: Freeze the weights up to last layer, replace the fully connected layer, and retrain.

Why? Less data means you can overfit if you train the entire network.

2) Your dataset is large and similar to the dataset the model was pre-trained on.

What to do: Freeze the earlier layer weights. Then retain later weights with a new fully connected layer.

3) Your dataset is small and different from the dataset the model was pre-trained on.

This is the most difficult situation to deal with. The pre-trained network is already finely-tuned at each layer. You don’t want any of the high-level features and you can’t afford to retrain because you run the risk of overfitting.

What to do: Remove the fully connected layers and convolutional layers closer to the output. Retrain the convolutional layers closer to the input.

4) Your dataset is large and different from the dataset the model was pre-trained on.

You could try to instantiate the pre-trained model weights to speed up training (lot’s of the low-level convolutions will have similiar weights), or select the architecture with decent initialization.

What to do: Retrain the entire network from scratch, making sure to replace the fully connected output layer.

Conjuring some insight/thoughts/wisdom from @EKhvedchenya @OBaratz @salmankhaliq22 @sGx_tweets @kbaheti @lu.riera @kausthubk @richmond @chris

As a rule of thumb, the less data I have - the more regularization I apply.
This mostly include the augmentations (including quite extreme ones) and self-supervised learning if the task allows.

It is very beneficial to use pretrained models for transfer learning. Regarding what layers to freeze - I usually don’t bother with freezing any layers and just use RAdam optimizer to keep initial weights from being destroyed at the start of the training.

Maybe the only scenario when transfer learning may not helpful is when you have really huge domain shift (Let’s say image classifier for ImageNet that you want to re-use for image segmentation of microscopy images). However even in this case you should check whether pretrained weighs make it better or worse.

1 Like

Yes, it means that you will lose the generalization of the model since you do not have enough data to start with, training the whole network would only make the calculations faster but it would overfit the model for sure.

Yes, that is what I do as well, and sometimes I train the last convolutional layer as well even if the dataset is big.

I would start by freezing the whole network and only train the fully connected layers first and then will unfreeze convolution layers one at a time to train the high-level features.

I would do exactly the same thing as I suggested in point number 3.