In many documents, people use conv -> pooling -> non-linearity . This does not mean that you cannot use a different order and get reasonable results. In the case of the maximum union layer and ReLU, the order does not matter (both calculate the same thing):

You can prove that this is true if you recall that ReLU is an elementary operation and a non-decreasing function, therefore

The same thing happens for almost every activation function (most of them do not decrease). But it does not work for the common pool layer (middle pool).
Nevertheless, both orders give the same result, Activation(MaxPool(x)) does it much faster, performing fewer operations. For a union layer of size k it uses k^2 times less than the activation function calls.
Unfortunately, this optimization for CNN is negligible, since most of the time is used in convolutional layers.
Salvador dali
source share