I am a newbie to deep learning and after watching a video on dl, I got thinking.

How does one choose the number of hidden layers to use for a network? Also, how do you choose the number of neurons to use for each hidden layer?

Thank you.

Thatâ€™s such a great question, with a horribly unsatisfying answer:

It depends.

I know, not helpful. But alas, itâ€™s true.

What Iâ€™ll say is that there are a number of resources that might help you decide on a good place to start - but nothing will ever beat the king of model/layer/neuron selection: **Iteration.**

Howeverâ€¦

â€¦for some general rules/places to get started, check out this blog post, which has the following chart - which can be a helpful place to get the party started!

Excuse my terrible formatting:

**______** Number of Hidden layers **______**

Num Hidden Layers | Result |
---|---|

none | Only capable of representing linear separable functions or decisions. |

1 | Can approximate any function that contains a continuous mapping from one finite space to another. |

2 | Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy. |

>2 | Additional layers can learn complex representations (sort of automatic feature engineering) for layer layers. |

Youâ€™ll notice that `num_hidden_layers`

requires you to have a good understanding of your data, which is a trend youâ€™ll notice everywhere in this field!

**_______________________________________**

The blog also references a set of â€śback of the napkinâ€ť rules I see often quoted when it comes `num_neurons`

in a hidden layer:

**_________** Number of Neurons **_________**

- The number of hidden neurons should be between the size of the input layer and the size of the output layer.
- The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
- The number of hidden neurons should be less than twice the size of the input layer.

**_______________________________________**

Remember though, these are starting points - and only through careful iteration and experimentation will you find the â€ścorrectâ€ť answer!

There are also computation complexity/cost considerations that were glazed over - but models donâ€™t exist for free. So remember: The more complex a model, the more itâ€™ll likely cost to train it!

I donâ€™t think there is an actual method that will definitely lay out how to build the Neural Network architecture, although the information shared by Chris above is a great way to get started and try some things out.

This part of Machine Learning is an evolving field and to be honest, if we did know this, we would probably be a lot closer to some form of general intelligence in machines.

The current methodology is centered around a ton of experimentation and testing - with many engineers involved in the process - as well as a ton of resources (time, compute, money).

There is even a branch of ML that uses ML itself to help with this. It is known as Neural Architecture Search (NAS) and it leverages machine learning to help search for the optimal neural network architecture for specific tasks. Instead of optimizing the weights of a Neural Network, NAS seeks to optimize for an architecture in a space of many different architectures. This process is also highly compute intensive - though it has yielded some promising results.

Here is an article talking a bit about this process: Neural Architecture Search

Oftentimes, the architecture part of ML is done by teams of highly experienced researchers (Google, Facebook, Uber, etc) and the implementation of these architectures can be done by the rest of us

This is where something like Transfer Learning could be applied. You can take a model that has already been built with some decent performance and fine tune the weights further by further training some layers of the model with data specific to the task you are trying to accomplish

Frankly, we do not know. It depends on so many factors: model architecture,

There is no magic formula and people usually used common sense to find number of parameters (For instance, what we see in ResNets: if we reduce spatial size twice, letâ€™s increase number of channels twice as well).

Recently, EfficientNet paper showed that how how to scale a model in depth and width according to image size. So some research is going on to find how to answer this question in a grounded fashion.

When designing a custom model, a test&trial approach can work:

- Start with a model with a big number of channels and ensure it can overfit the train.
- Reduce the number of channels twice and check whether it can still overfit.
- Repeat until model is struggling to train. This would be your lower bound of the modelâ€™s capacity.

Oh yeah this is a rabbithole with more questions than answers.

From a computer vision perspective you want to think less in terms of number of neurons because there are that many more of them so we think about it in a slightly higher scale (convolutional layers/kernel sizes and tensor reduction).

E.g. you start off with a 512x512 RGB image - thatâ€™s a tensor of dimensions 512x512x3. You want to process that until you get a fully connected layer which is your embedding - i.e. it codes the important information in the image. On top of this we add a multi-layer perceptron to learn how to use the embedding to do things like classify.

So how do I reshape it to that â€ś1xlotsâ€ť fully connected layer? In a traditional CNN you use multiple methods (pooling, convolving, etc.)

So letâ€™s think in **Convolutional Layers** (this operation is highly parallelizable and which is why we can actually model it under the hood with neurons and weights). Itâ€™s a big number to think about so itâ€™s easier for me to define whether I want 2 conv layers, or 3 conv layers in my CNN (or lots more) and this carries similar considerations to selecting the number of hidden layers in an MLP - just at a much higher scale.

We work backwards in a sense - by defining the kernel sizes it controls the output tensor size of each conv layer. So we define it kind of indirectly. This is essentially similar to selecting the number of neurons in each layer but, again, at a much higher scale.

The test & trial approach above by @EKhvedchenya makes sense and is pretty common to be honest - itâ€™ll give you an intuition around a few things like:

- capacity of a network (how much information it can actually discern clearly)
- the computational restrictions of a network (how much GPU you need)
- additional side-effects (vanishing gradients and the use cases for tools such as auxiliary losses etc. - thereâ€™s a lot here that has plenty of material online to read)

This is a hot topic many of my students ask me. I always show them the same tutorial, and they realize the basics of Neural Networks. Check it out: