CNNs are computed based on the frame windows. You take, say, 30 surrounding frames and send CNN to classify them. In this case, you need to have frame labels that you can get from other speech recognition tools.
If you want to have a pure neural network decoding, you better prepare a recurrent neural network (RNN), they will allow the use of arbitrary lengths. To improve the accuracy of RNN, you are also better off having a CTC layer that allows you to configure stateless alignment without a network.
If you are interested in the topic, you can try https://github.com/srvk/eesen , a set of tools designed for end-to-end speech recognition with repeating neural networks.
MFCC