In practice, this is not difficult (based on coded and prepared dozens of MLPs).
In the sense of a textbook, the correct architecture of “rights” - that is, to configure your network architecture so that performance (resolution) cannot be improved by further optimizing the architecture, I agree. But only in rare cases does this degree of optimization be required.
In practice, in order to meet or exceed the accuracy of forecasting from a neural network required by your specification, you almost do not need to spend a lot of time on network architecture - there are three reasons why this is so:
most of the parameters necessary to indicate the network architecture are fixed after you have decided on the data model (the number of functions in the input vector, whether the desired response variable is numeric or categorical, and if the latter, how many unique label classes you selected);
the few remaining architecture parameters that are actually configurable are almost always (100% of the time in my experience) parameters very limited by these fixed architectures - i.e. the values ​​of these parameters are tightly limited by the values ​​of max and min; and
the optimal architecture should not be determined before the training begins, indeed, very often for the neural network code, turn on a small module for programmatically setting up the architecture network during training (by removing nodes whose weight values ​​approach zero - they are usually called “trimming”).

In accordance with the table above, the architecture of the neural network is completely determined by the parameters of six (six cells in the internal grid). Two of them (the number of layer types for input and output layers) are always one and one - neural networks have one input level and one output level. Your NN must have at least one input level and one output level - no more, no less. Secondly, the number of nodes containing each of these two layers is fixed - the input level is the size of the input vector - that is, the number of nodes in the input layer is equal to the length of the input vector (in fact, another neuron is almost always added to the input layer as a node offset) .
Similarly, the size of the output level is fixed by the response variable (a single node for the numeric response variable and (if it is assumed that softmax is used, if the response variable is a class label, the number of nodes at the output level is simply equal to the number of unique class labels).
This leaves only two parameters for which there is any discretion - the number of hidden layers and the number of nodes that make up each of these layers.
The number of hidden layers
if your data is linearly shared (which you often knew by the time you start coding NN), you don’t need hidden layers at all. (If this is true, I would not use NN for this problem - choose a simpler linear classifier). The first of them - the number of hidden layers - is almost always the same. There is a lot of empirical weight behind this presumption — in practice, very few problems that cannot be solved with one hidden layer become available by adding another hidden layer. Similarly, there is consensus on the difference in performance from adding additional hidden layers: situations in which performance is improved by using a second (or third, etc.) Hidden layer are very small. One hidden layer is sufficient for most problems.
In your question, you mentioned that for some reason you cannot find the optimal network architecture by trial and error. Another way to configure your NN configuration (without trial and error) is to trim . "The essence of this technique is to remove nodes from the network during training by identifying those nodes that, if removed from the network, will not significantly affect network performance (that is, data resolution). (Even without using the formal trimming technique, you can get an approximate idea of ​​which nodes are not important, looking at your weight matrix after training, look for weights very close to zero - these are nodes on at the ends of these weights, which are often removed during cropping.) Obviously, if you use the cropping algorithm during training, start with the network configuration, which is likely to have redundant (ie, “contractible”) nodes — other in words, when deciding on a network architecture, err is on the side of more neurons if you add a cropping step.
In other words, by applying the pruning algorithm to your network during training, you can be much closer to an optimized network configuration than any a priori theory that can ever give you.
The number of nodes containing a hidden layer
but what about the number of nodes containing the hidden layer? The resulting value is more or less unlimited - that is, it can be less or more than the size of the input layer. In addition, as you probably know, there is a mountain of comments on the issue of configuring a hidden layer in NN (see the Famous NN FAQ for an excellent summary of this comment). There are many empirically developed thumb rules, but of these, the size of the hidden layer between the input and output levels is most often relied upon . Jeff Heaton, author of Introduction to Neural Networks in Java , offers a few more that are listed on the page that I just linked to. Similarly, scanning application literature on a neural network will almost certainly show that the size of the hidden layer is usually between the size of the input and output levels. But between them does not mean in the middle; in fact, as a rule, it is better to set the size of the hidden layer closer to the size of the input vector. The reason is that if the hidden layer is too small, the network can hardly converge. err’s initial configuration in a larger size - a larger hidden layer gives the network more capacity, which helps to converge compared to a smaller hidden layer. Indeed, this excuse is often used to recommend a hidden layer size that exceeds (more nodes) the input level, that is, start with the initial architecture, which will facilitate rapid convergence, after which you can trim the “extra” nodes (identify nodes in the hidden layer with very low weight values ​​and exclude them from your network) tee with refactoring).