Thus, each approach has its advantages and disadvantages. Go through the ones you provided and then others to find the best approach:
LSTM : Among their biggest benefits is the ability to learn long-term dependency patterns in your data. They were designed to be able to analyze long sequences, such as speech or text. It can also cause problems due to number parameters, which can be really high. Other typical repetitive network architectures, such as GRU , can solve these problems. The main disadvantage is that in their standard (sequential implementation) it is impossible to place it on video data for the same reason why dense layers are bad for image data - time loads and spatial invariants should be studied using topology that is completely unsuitable for them capture efficiently. Moving the video with a pixel to the right can completely change the output of your network.
Another thing worth mentioning is that learning LSTM may not be like balancing two rivalries - finding good weights for tight output and finding good internal memory dynamics in processing sequences. Finding this equilibrium can take a very long time time, but as soon as it is detected, it is usually quite stable and gives really good results.
Conv3D : Among their biggest advantages, one can easily find the ability to capture spatial and temporal invariants in the same way as Conv2D in the case of images. This makes the curse of dimension much less harmful. On the other hand - just as Conv1D may not give good results with longer sequences - in the same way - the absence of any memory can make learning a long sequence difficult.
Of course, you can use different approaches like:
TimeDistributed + Conv2D : using the TimeDistributed wrapper - you can use some pre-prepared convection, for example, Inception , and then analyze the character maps in sequence. A truly huge advantage of this approach is the ability to transfer training. As a drawback - you can think of it as Conv2.5D - it lacks a temporary analysis of your data.
ConvLSTM : this architecture is not yet supported by the latest version of Keras (March 6, 2017), but as you can see here it should be provided in the future. This is a mix of LSTM and Conv2D , and she preferred to be better than laying Conv2D and LSTM .
Of course, this is not the only way to solve this problem, I mentioned one more thing that may be useful:
- Stacking: You can easily stack top methods to build your final solution. For example. you can build a network where at the beginning the video is transformed using
TimeDistributed(ResNet) , then the output will be transferred to Conv3D with a multiple and aggressive spatial pool and, finally, converted by the GRU/LSTM layer.
PS:
Another thing worth mentioning is that the shape of the video data is actually 4D with (frames, width, height, channels ).
PS2:
In case your data is actually 3D with (frames, width, hieght) , you could use the classic Conv2D (by changing the channels to frames ) to analyze this data (which may actually be more computationally efficient) . In the case of learning translation, you must add an extra dimension, because most CNN models have been trained on data with a shape (width, height, 3) . You may notice that your data does not have 3 channels. In this case, the method that is commonly used repeats the spatial matrix three times.
PS3:
An example of this 2.5D approach is:
input = Input(shape=input_shape) base_cnn_model = InceptionV3(include_top=False, ..) temporal_analysis = TimeDistributed(base_cnn_model)(input) conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(temporal_analysis) conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(conv3d_analysis) output = Flatten()(conv3d_analysis) output = Dense(nb_of_classes, activation="softmax")(output)
Marcin Możejko
source share