There is a general method called Time Scaling that can do this. A tool that I did not appreciate is available here: http://sourceforge.net/projects/mffmtimescale/ .
If you approach the time line of the audio, it is very similar to the old heart rate monitor - wiggly patterns of peaks and valleys. For vowels, the picture is quasi-stationary, which roughly means its repeatability, like the pulse of a healthy pulse. One vowel ahhhh sound can repeat its pattern 3-7 times in ordinary speech. The TSM algorithm removes some of these repetitions and should use a filter to input artifacts by trimming / joining imperfect repetitions. Empty spaces can also be reduced, but care must be taken not to remove all empty space. In English, the word “football” actually has a gap between “foot” and “ball” (say slowly out loud). TSM can also do the opposite, pumping into empty space in spots of rights or adding repetition of the step period to vowels. All this is connected with something rather complex and somewhat dependent on the language, which requires a lot of customization, which for most applications means that you do not want to develop your own.
source share