webrtcvad is a Python shell around Google excellent WebRTC Realization of voice activity (VAD) is the best work of any VAD that I used, how correctly to classify human speech, even with noisy sound.
To use it for your purpose, you would do something like this:
- Convert the file to 8 kHz or 16 kHz, 16 bit monaural. This is required by the WebRTC code.
- Create a VAD object:
vad = webrtcvad.Vad() - Divide the sound into 30 millisecond fragments.
- Check each snippet to see if it contains speech:
vad.is_speech(chunk, sample_rate)
The VAD output can be โnoisy," and if it classifies one 30-millisecond piece of sound as speech, you really don't want to output that time. You probably want to look at the last 0.3 seconds (or so) of the sound and see if most of the 30 millisecond fragments in this period are classified as speech. If they are, then you start the start time of this 0.3 second period as the start of speech. Then you do something similar to determine when the speech ends. Wait until the 0.3 second audio period, where most of the 30 millisecond pieces are not classified as VAD speech - when this happens, output the end time as the end of the speech.
You may need to adjust the time a little to get good results for your goals - you may have decided that you need 0.2 seconds of audio, where more than 30% of the pieces are classified as VAD speech before you start, and 1.0 seconds of audio with over 50% of non-speech classifieds before you deactivate.
A ring buffer ( collections.deque in Python) is a useful data structure for tracking the last N pieces of sound and classifying them.
John wiseman
source share