Estimation of the number of neurons and the number of layers of an artificial neural network

I am looking for a way to calculate the number of layers and the number of neurons per layer. As input, I am only the size of the input vector, the size of the output vector, and the size of the training set.

Usually the best network is determined by using different network topologies and choosing the one with the least error. Unfortunately, I cannot do this.

+75
artificial-intelligence deep-learning machine-learning neural-network
Jul 27 2018-10-17T00:
source share
3 answers

This is a very difficult problem.

The more internal the network has, the better will be the network of complex solutions. On the other hand, too large an internal structure is slower, can lead to divergence of training or lead to retraining - which will prevent your network from generalizing new data well.

People have traditionally approached this problem in several ways:

  • Try different settings, see what works best. You can divide your training set into two parts - one for training, one for assessment - and then train and evaluate different approaches. Unfortunately, this sounds like in your case, this experimental approach is not available.

  • Use the rule of thumb. Many people have come up with a lot of guesses about what works best. Regarding the number of neurons in the hidden layer, people suggested that (for example) it should (a) be between the size of the input and output levels, (b) set something nearby (inputs + outputs) * 2/3 or (c) never no more than twice the size of the input layer.

    The problem with the rules of thumb is that they do not always take into account important pieces of information, as the “difficult” problem is the size of the training and testing sets, etc. Therefore, these rules are often used as rough starting points for "let's-try-a-bunch-of-things-and-see-what-works-best."

  • Use an algorithm that dynamically adjusts the network configuration. Algorithms such as Cascade Correlation start with a minimal network, and then add hidden nodes during training. This can make your experimental setup a little easier, and (theoretically) can lead to better performance (because you will not accidentally use the wrong number of hidden nodes).

There is a lot of research on this topic, so if you are really interested, there is something to read. Check out the quotes in this resume , in particular:

+102
Jul 27 '10 at 16:30
source share

In practice, this is not difficult (based on coded and prepared dozens of MLPs).

In the sense of a textbook, the correct architecture of “rights” - that is, to configure your network architecture so that performance (resolution) cannot be improved by further optimizing the architecture, I agree. But only in rare cases does this degree of optimization be required.

In practice, in order to meet or exceed the accuracy of forecasting from a neural network required by your specification, you almost do not need to spend a lot of time on network architecture - there are three reasons why this is so:

  • most of the parameters necessary to indicate the network architecture are fixed after you have decided on the data model (the number of functions in the input vector, whether the desired response variable is numeric or categorical, and if the latter, how many unique label classes you selected);

  • the few remaining architecture parameters that are actually configurable are almost always (100% of the time in my experience) parameters very limited by these fixed architectures - i.e. the values ​​of these parameters are tightly limited by the values ​​of max and min; and

  • the optimal architecture should not be determined before the training begins, indeed, very often for the neural network code, turn on a small module for programmatically setting up the architecture network during training (by removing nodes whose weight values ​​approach zero - they are usually called “trimming”).

enter image description here

In accordance with the table above, the architecture of the neural network is completely determined by the parameters of six (six cells in the internal grid). Two of them (the number of layer types for input and output layers) are always one and one - neural networks have one input level and one output level. Your NN must have at least one input level and one output level - no more, no less. Secondly, the number of nodes containing each of these two layers is fixed - the input level is the size of the input vector - that is, the number of nodes in the input layer is equal to the length of the input vector (in fact, another neuron is almost always added to the input layer as a node offset) .

Similarly, the size of the output level is fixed by the response variable (a single node for the numeric response variable and (if it is assumed that softmax is used, if the response variable is a class label, the number of nodes at the output level is simply equal to the number of unique class labels).

This leaves only two parameters for which there is any discretion - the number of hidden layers and the number of nodes that make up each of these layers.

The number of hidden layers

if your data is linearly shared (which you often knew by the time you start coding NN), you don’t need hidden layers at all. (If this is true, I would not use NN for this problem - choose a simpler linear classifier). The first of them - the number of hidden layers - is almost always the same. There is a lot of empirical weight behind this presumption — in practice, very few problems that cannot be solved with one hidden layer become available by adding another hidden layer. Similarly, there is consensus on the difference in performance from adding additional hidden layers: situations in which performance is improved by using a second (or third, etc.) Hidden layer are very small. One hidden layer is sufficient for most problems.

In your question, you mentioned that for some reason you cannot find the optimal network architecture by trial and error. Another way to configure your NN configuration (without trial and error) is to trim . "The essence of this technique is to remove nodes from the network during training by identifying those nodes that, if removed from the network, will not significantly affect network performance (that is, data resolution). (Even without using the formal trimming technique, you can get an approximate idea of ​​which nodes are not important, looking at your weight matrix after training, look for weights very close to zero - these are nodes on at the ends of these weights, which are often removed during cropping.) Obviously, if you use the cropping algorithm during training, start with the network configuration, which is likely to have redundant (ie, “contractible”) nodes — other in words, when deciding on a network architecture, err is on the side of more neurons if you add a cropping step.

In other words, by applying the pruning algorithm to your network during training, you can be much closer to an optimized network configuration than any a priori theory that can ever give you.

The number of nodes containing a hidden layer

but what about the number of nodes containing the hidden layer? The resulting value is more or less unlimited - that is, it can be less or more than the size of the input layer. In addition, as you probably know, there is a mountain of comments on the issue of configuring a hidden layer in NN (see the Famous NN FAQ for an excellent summary of this comment). There are many empirically developed thumb rules, but of these, the size of the hidden layer between the input and output levels is most often relied upon . Jeff Heaton, author of Introduction to Neural Networks in Java , offers a few more that are listed on the page that I just linked to. Similarly, scanning application literature on a neural network will almost certainly show that the size of the hidden layer is usually between the size of the input and output levels. But between them does not mean in the middle; in fact, as a rule, it is better to set the size of the hidden layer closer to the size of the input vector. The reason is that if the hidden layer is too small, the network can hardly converge. err’s initial configuration in a larger size - a larger hidden layer gives the network more capacity, which helps to converge compared to a smaller hidden layer. Indeed, this excuse is often used to recommend a hidden layer size that exceeds (more nodes) the input level, that is, start with the initial architecture, which will facilitate rapid convergence, after which you can trim the “extra” nodes (identify nodes in the hidden layer with very low weight values ​​and exclude them from your network) tee with refactoring).

+43
Jul 27 '10 at 21:10
source share

I used MLP for commercial software that has only one hidden layer, which has only one node. Since the input nodes and output nodes are fixed, I only had to change the number of hidden layers and play with the achieved generalization. I never got much difference in achieving just one hidden layer and one node by changing the number of hidden layers. I just used one hidden layer with one node. It worked quite well, and the reduced calculations were very tempting in my software.

0
Sep 10 '14 at 14:11
source share



All Articles