Using Keras to Predict Videos (Time Series)

Question

Using Keras to Predict Videos (Time Series)

I want to predict the next frame of (gray) video with N previous frames - using CNN or RNN in Keras. Most tutorials and other information regarding time series forecasting and Keras use one-dimensional input on their network, but mine would be 3D (N frames x rows x cols)

I am currently not really sure what a good approach is for this problem. My ideas include:

Use of one or more LSTM layers. The problem here is that I'm not sure if they are suitable to take a series of images instead of a series of scalars as input. Wouldn't memory consumption explode? If this is normal, use them: how can I use them in Keras for higher dimensions?
Using three-dimensional convolution at the input (stack of previous video frames). This raises other questions: why does it help when I am not doing a classification, but a prediction? How can I stack the layers so that the input of the network has dimensions (N x cols x rows) and output (1 x cols x rows) ?

I am new to CNN / RNN and Keras and would appreciate any hint in the right direction.

+7

time-series machine-learning neural-network keras lstm

Isa Mar 6 '17 at 19:08

source share

2 answers

After a lot of research, I finally stumbled upon the “Keras Example” for the ConvLSTM2D layer (already mentioned by Marsin Mozheiko) which does exactly what I need.

In the current version of Keras (v1.2.2), this level is already included and can be imported using

 from keras.layers.convolutional_recurrent import ConvLSTM2D

To use this layer, video data must be formatted as follows:

 [nb_samples, nb_frames, width, height, channels] # if using dim_ordering = 'tf' [nb_samples, nb_frames, channels, width, height] # if using dim_ordering = 'th'

+4

Isa Mar 07 '17 at 10:33

source share

Marcin Możejko · Accepted Answer · 2017-03-06T21:08:10+0000

Thus, each approach has its advantages and disadvantages. Go through the ones you provided and then others to find the best approach:

LSTM : Among their biggest benefits is the ability to learn long-term dependency patterns in your data. They were designed to be able to analyze long sequences, such as speech or text. It can also cause problems due to number parameters, which can be really high. Other typical repetitive network architectures, such as GRU , can solve these problems. The main disadvantage is that in their standard (sequential implementation) it is impossible to place it on video data for the same reason why dense layers are bad for image data - time loads and spatial invariants should be studied using topology that is completely unsuitable for them capture efficiently. Moving the video with a pixel to the right can completely change the output of your network.
Another thing worth mentioning is that learning LSTM may not be like balancing two rivalries - finding good weights for tight output and finding good internal memory dynamics in processing sequences. Finding this equilibrium can take a very long time time, but as soon as it is detected, it is usually quite stable and gives really good results.
Conv3D : Among their biggest advantages, one can easily find the ability to capture spatial and temporal invariants in the same way as Conv2D in the case of images. This makes the curse of dimension much less harmful. On the other hand - just as Conv1D may not give good results with longer sequences - in the same way - the absence of any memory can make learning a long sequence difficult.

Of course, you can use different approaches like:

TimeDistributed + Conv2D : using the TimeDistributed wrapper - you can use some pre-prepared convection, for example, Inception , and then analyze the character maps in sequence. A truly huge advantage of this approach is the ability to transfer training. As a drawback - you can think of it as Conv2.5D - it lacks a temporary analysis of your data.
ConvLSTM : this architecture is not yet supported by the latest version of Keras (March 6, 2017), but as you can see here it should be provided in the future. This is a mix of LSTM and Conv2D , and she preferred to be better than laying Conv2D and LSTM .

Of course, this is not the only way to solve this problem, I mentioned one more thing that may be useful:

Stacking: You can easily stack top methods to build your final solution. For example. you can build a network where at the beginning the video is transformed using TimeDistributed(ResNet) , then the output will be transferred to Conv3D with a multiple and aggressive spatial pool and, finally, converted by the GRU/LSTM layer.

PS:

Another thing worth mentioning is that the shape of the video data is actually 4D with (frames, width, height, channels ).

PS2:

In case your data is actually 3D with (frames, width, hieght) , you could use the classic Conv2D (by changing the channels to frames ) to analyze this data (which may actually be more computationally efficient) . In the case of learning translation, you must add an extra dimension, because most CNN models have been trained on data with a shape (width, height, 3) . You may notice that your data does not have 3 channels. In this case, the method that is commonly used repeats the spatial matrix three times.

PS3:

An example of this 2.5D approach is:

 input = Input(shape=input_shape) base_cnn_model = InceptionV3(include_top=False, ..) temporal_analysis = TimeDistributed(base_cnn_model)(input) conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(temporal_analysis) conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(conv3d_analysis) output = Flatten()(conv3d_analysis) output = Dense(nb_of_classes, activation="softmax")(output)

Using Keras to Predict Videos (Time Series)

More articles: