I would reconsider using a speech recognition library ... for example CMU Sphinx software or Microsoft speech recognizer . Unfortunately, this is not an easy task to do it yourself. One approach, which is somewhat typical for how to accomplish what you are trying to do, is as follows:
1) Grind the sample into small segments (a few milliseconds)
2) Fourier transform on each segment, collecting the main coefficients
3) use the hidden Markov model to find out the likely transition of phonemes based on your sequence of coefficients
4) go to a dictionary depicting phonemes, to words (you can see the Sphinx dictionary as a guide) ... a small set like yours should give excellent results.
If you want to simplify this a bit, you can try using coefficients with specific timestamps and submit them to the SVM or neural network ... I have not tried this yet, but you could get reasonable results with some tweaking.
source share