Normalization of normalization and normalization of the party

Question

Normalization of normalization and normalization of the party

I understand that batch normalization helps speed up learning by turning activation to a Gaussian distribution, and thus solving the problem of fading gradients. The actions of the batch norm are applied differently during training (use the average value / variable for each batch) and during the test (use the final average value / variable for the training phase).

Normalization of instances, on the other hand, acts as normalization of contrast, as mentioned in this document https://arxiv.org/abs/1607.08022 . The authors note that the output stylized images should not depend on the contrast of the input image content, and therefore, the normalization of instances helps.

But then we should also not use instance normalization to classify images, where the class label should not depend on the contrast of the input image. I have not seen any paper using instance normalization instead of batch normalization for classification. What is the reason for this? In addition, you can and should use the normalization of the batch and instance together. I strive to get both an intuitive and a theoretical understanding of when to use which normalization.

+31

machine-learning computer-vision neural-network conv-neural-network batch-normalization

Ruppesh nalwaya Aug 2 '17 at 14:34

source share

4 answers

Maxim · Answer 1 · 2018-01-05T18:01:06+0000

Definition

Let's start with a strict definition of both:

Batch normalization

Instance normalization

As you can see, they do the same, with the exception of the number of input tensors, which are normalized together. The batch version normalizes all images in batch and spatial locations (in the case of CNN, in the usual case it is different ); The instance version normalizes each batch independently, i.e. only for spatial locations.

In other words, where the batch norm calculates one mean and standard deviation (thus making the distribution of the entire layer Gaussian), the instance norm calculates T of them, as a result of which each individual image distribution looks Gaussian, but not together.

A simple analogy: at the stage of data preprocessing, you can normalize the data for each image or normalize the entire data set.

^{Credit: formulas from here .}

Which normalization is better?

The answer depends on the network architecture, in particular on what is done after the normalization level. Image classification networks usually combine object maps together and link them to the FC layer, which share the weights by package (the modern way is to use the CONV layer instead of FC, but the argument still applies).

It is here that the nuances of distribution begin to matter: the same neuron will receive information from all images. If the dispersion between batches is high, the gradient from small activations will be completely suppressed by high activations, which is exactly the problem that the batch norm is trying to solve. Therefore, it is possible that normalization for each instance does not improve network convergence at all.

On the other hand, batch normalization adds extra noise to learning, because the result for a particular instance depends on neighboring instances. As it turns out, such noise can be both good and bad for the network. This is well explained in the article “weight normalization” by Tim Salimmans et al., In which recurrent neural networks and DQNs for gain training are called noise-sensitive applications. I'm not quite sure, but I think that the same sensitivity to noise was the main problem in the stylization problem that they tried to deal with the instance norm. It would be interesting to check whether the weight norm is better for this particular task.

Can you combine batch and instance normalization?

Although he makes a valid neural network, there is no practical use for it. Batch noise normalization either helps the learning process (in this case, it is preferable), or does harm to it (in this case it is better to skip it). In both cases, logging out with one type of normalization can improve performance.

Stephen morrell · Answer 2 · 2018-09-28T11:11:29+0000

Great question and already answered well. Just to add: I found this visualization from a Kaiming He Group Norm article useful.

Source: link to an article on average contrasting norms

hkchengrex · Answer 3 · 2018-10-03T13:13:19+0000

I would like to add more information to this question, since there are several later works in this area. Your intuition

use instance normalization to classify images where the class label should not depend on the contrast of the input image

partly correct. I would say that a pig in broad daylight is a pig when a picture is taken at night or at dawn. However, this does not mean that using instance normalization on the network will give you a better result. Here are a few reasons:

Color distribution still play a role. It is more like an apple than an orange, if it has a lot of red.
At later levels, you can no longer imagine that normalizing instances acts like normalizing contrast. Class-specific details will appear in deeper layers, and their normalization separately will significantly degrade model performance.

In its model, IBN-Net uses both batch normalization and instance normalization. They set the normalization of the instance only at an early level and have improved both in accuracy and in the ability to generalize. They have open source here .

praveen · Answer 4 · 2019-02-14T09:20:24+0000

IN provides visual and external dispersion, while BN accelerates learning and retains its distinctive features. IN is preferred in a shallow layer (CNN starting layer), so appearance changes should be avoided, and BN is preferred in deep layers (last CNN layer) to reduce discrimination.

Normalization of normalization and normalization of the party

Definition

Which normalization is better?

Can you combine batch and instance normalization?

More articles: