How to handle the dynamic input size for the audio program used by CNN?

Many articles use CNN to extract audio features. The input data is a spectrogram with two measurements, time and frequency.

When creating a sound spectrogram, you need to specify the exact size of both measurements. But they are usually not fixed. You can specify the dimension size by window size, but what about the time domain? The lengths of the audio samples are different, but the size of the CNN input should be fixed.

In my datasets, audio lengths range from 1s to 8s. Indentation or cutting always affects the results too much.

So, I want to know more about this method.

+4
source share
1 answer

CNNs are computed based on the frame windows. You take, say, 30 surrounding frames and send CNN to classify them. In this case, you need to have frame labels that you can get from other speech recognition tools.

If you want to have a pure neural network decoding, you better prepare a recurrent neural network (RNN), they will allow the use of arbitrary lengths. To improve the accuracy of RNN, you are also better off having a CTC layer that allows you to configure stateless alignment without a network.

If you are interested in the topic, you can try https://github.com/srvk/eesen , a set of tools designed for end-to-end speech recognition with repeating neural networks.

MFCC

+1

All Articles