Batch normalization in a convolutional neural network

I am new to convolutional neural networks and just have an idea of ​​function maps and how convolution is performed on images to extract functions. I would be glad to know some details about applying batch normalization to CNN.

I read this article https://arxiv.org/pdf/1502.03167v3.pdf and was able to understand the BN algorithm applied to the data, but at the end they mentioned that a little modification is required when applied to CNN:

For convolutional layers, we additionally want the normalization to obey the convolutional property, so that different elements of the same feature map in different places are normalized equally. To do this, we jointly normalize all activations in the mini-lot to all locations. In Alg. 1, let B be the set of all values ​​in the function map on both elements of the mini-batch and spatial locations - therefore, for the mini-lot of size m and characteristic cards of size p × q, we use effec - a miniature mini-lot of size m '= | B | = m pq. We study a pair of parameters γ (k) and β (k) for each functional map, and not for activation. Alg. 2 is modified similarly, so that during the output, the BN transform applies the same linear transform to each activation in a given characteristic map.

I am confused when they say, "so that different elements of the same function map in different places are normalized the same way"

I know which map icons mean, and the different elements are the weights in each characteristics map. But I could not understand what location or spatial location meant.

I could not understand the following sentence at all : "In Al. 1, we designate B as the set of all values ​​in the feature map for both elements of the mini-batch and spatial locations"

I would be glad if someone caught a cold and explained to me in much simpler terms

+54
deep-learning machine-learning computer-vision conv-neural-network batch-normalization
source share
3 answers

Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C] , where B is the lot size, (H, W) is the size of the object map, C is the number of channels. Index (x, y) , where 0 <= x < H and 0 <= y < W is the spatial arrangement.

Normal battorm

Now, as the batnorm is applied in the usual way (in pseudocode):

 # t is the incoming tensor of shape [B, H, W, C] # mean and stddev are computed along 0 axis and have shape [H, W, C] mean = mean(t, axis=0) stddev = stddev(t, axis=0) for i in 0..B-1: out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev) 

Basically, it calculates the values ​​of H*W*C and H*W*C standard deviations in elements of B You may notice that different elements in different spatial locations have their own value and variance and only collect B values.

Batover in conv layer

This way is quite possible. But the convolutional layer has a special property: the filter density is distributed according to the input image (you can read it in detail in this post ). Therefore, it is reasonable to normalize the output in the same way that each output value takes the average value and variance of the values ​​of B*H*W in different places.

Here is the code in this case (again, a pseudo-code):

 # t is still the incoming tensor of shape [B, H, W, C] # but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape mean = mean(t, axis=(0, 1, 2)) stddev = stddev(t, axis=(0, 1, 2)) for i in 0..B-1, x in 0..H-1, y in 0..W-1: out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev) 

In general, there are only C means and standard deviations, and each of them is calculated over the values ​​of B*H*W This is what they mean when they say "effective mini-batch": the difference between them is only in the choice of axis (or, what is the same, "mini-batch choice").

+65
source share

I am only 70% sure that I am saying, therefore, if that does not make sense, edit or mention it before downvoting.

About location or spatial location : they mean the position of pixels in an image or map of objects. The function map is comparable to a rare modified version of the image where concepts are presented.

About so that different elements of the same feature map, at different locations, are normalized in the same way : some normalization algorithms are local, so they depend on their close environment (location), and not on what is far from each other in image. They probably mean that each pixel, regardless of their location, is treated in the same way as a set element, regardless of it, a direct special environment.

About In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations : they get a flat list of all the values ​​of each training example in a mini-tea, and this list combines everything that would their location was not on the map functions.

+1
source share

Some clarifications on the answer of Maxim.

I was puzzled to see in Keras that the axis you indicated is the axis of the channels, since it does not make sense to normalize the channels - since each channel in the network is considered as a separate "feature". That is, normalization across all channels is equivalent to normalizing the number of bedrooms measuring square feet (an example of multidimensional regression from the course of Andrew M.L.). Usually this is not what you want - what you do, it normalizes each function separately. That is, you normalize the number of bedrooms in all examples with mu = 0 and std = 1, and you normalize square feet for all examples with mu = 0 and std = 1.

That's why you want C to mean stds, because you want the middle and standard for each channel / function.

After checking and testing, I myself understood the problem: there is some confusion / misconception. The axis that you specify in Keras is actually an axis that is not in the calculations. that is, you get the average value on each axis except the one indicated in this argument. This is confusing as it is the exact opposite behavior of how NumPy works, where the specified axis is the one you are performing the operation on (for example, np.mean, np.std, etc.).

I actually built a toy model with only BN, and then calculated BN manually - took the average value, standard deviation for all 3 first measurements [m, n_W, n_H] and got n_C results calculated (X-mu) / std (using translations) and got results almost identical to those of Keras.

Hope this helps anyone who was embarrassed like me.

0
source share

All Articles