This is called data extension . By applying transformations to training data, you add synthetic data points. This provides the model with additional options without the cost of collecting and annotating more data. This can lead to reduced retraining and improved generalization of the model.
The intuition behind flipping is that the object must be equally recognizable as a mirror image. Note that horizontal flipping is a commonly used type of flipping. Vertical flipping does not always make sense, but it depends on the data.
The idea of cropping is to reduce the background contribution to the CNN solution. This is useful if you have tags to locate your property. This allows you to use the surrounding regions as negative examples and create a better detector. Random cropping can also act as a regularizer and base your classification on the presence of parts of the object, rather than focusing everything on a very clear function that might not always be present.
Why do people always crop a square area?
This is not a limitation of CNN. This may be a limitation of a particular implementation. Or by design, because accepting square input can lead to optimized implementation for speed. I would not read too much about this.
CNN with variable input size and fixed input:
This does not apply to cropping per square, but in general why the input is sometimes changed / cropped / worn out before entering into CNN:
Something to keep in mind is that designing a CNN involves deciding whether to support an input of variable size or not. Convolution, union, and nonlinearity operations will work for any input measurements. However, when you use CNN to solve image classification, you usually get fully connected layer (s), such as logistic regression or MLP. A fully related layer is how CNN creates a fixed-size output vector. A fixed size output may limit CNN to a fixed size input.
There are definitely workarounds that allow you to enter a variable size and still produce a fixed size. The simplest is to use a convolution layer to classify by regular patches in the image. This idea has been around for a while. The idea was to detect multiple occurrences of objects in the image and classify each case. The earliest example I can come up with is the work of the Yann LeCun group in the 1990s to simultaneously classify and localize numbers in a string . This is called turning CNNs with fully connected layers into a fully convolutional network. The most recent examples of fully convolutional networks are used to solve the semantic segmentation and classification of each pixel in an image. Here it is required to produce an output corresponding to the size of the input. Another solution is to use the global pool at the end of CNN to turn variable size feature maps into a fixed size output file. The size of the union window is set equal to the function map computed from the last conv. layer.
ypx
source share