Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C] , where B is the lot size, (H, W) is the size of the object map, C is the number of channels. Index (x, y) , where 0 <= x < H and 0 <= y < W is the spatial arrangement.
Normal battorm
Now, as the batnorm is applied in the usual way (in pseudocode):
Basically, it calculates the values of H*W*C and H*W*C standard deviations in elements of B You may notice that different elements in different spatial locations have their own value and variance and only collect B values.
Batover in conv layer
This way is quite possible. But the convolutional layer has a special property: the filter density is distributed according to the input image (you can read it in detail in this post ). Therefore, it is reasonable to normalize the output in the same way that each output value takes the average value and variance of the values of B*H*W in different places.
Here is the code in this case (again, a pseudo-code):
In general, there are only C means and standard deviations, and each of them is calculated over the values of B*H*W This is what they mean when they say "effective mini-batch": the difference between them is only in the choice of axis (or, what is the same, "mini-batch choice").
Maxim
source share