Is there a quick way to find (not necessarily recognize) human speech in an audio file?

I want to write a program that automatically synchronizes unshakable subtitles. One of the solutions that I was thinking about was to somehow somehow find human speech and fine-tune it. The API that I found (Google Speech API, Yandex SpeechKit) work with servers (which is not very convenient for me) and (possibly) do a lot of unnecessary work, determining what exactly was said, while I only need to know that something has been said.

In other words, I want to give him an audio file and get something like this:

[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)] 

Is there a solution (preferably in python) that only finds human speech and works on the local machine?

+9
python voice recognition
source share
3 answers

The technical term for what you are trying to do is called Voice Activity Detection (VAD) . There is a Python SPEAR library that does this (among other things).

+7
source share

You can launch the window through your audio file and try to extract what fraction of the power of the entire signal is the human vocal frequency (the main frequencies lie between 50 and 300 Hz). The next is to give intuition and is not tested on real audio.

 import scipy.fftpack as sf import numpy as np def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300): """ Searching presence of frequencies on a real signal using FFT Inputs ======= X: 1-D numpy array, the real time domain audio signal (single channel time series) Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz) High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz) F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz) threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies. """ M = X.size # let M be the length of the time series Spectrum = sf.rfft(X, n=M) [Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample]) #Convert cutoff frequencies into points on spectrum [Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff]) totalPower = np.sum(Spectrum) fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies if fractionPowerInSignal > threshold: return 1 else: return 0 voiceVector = [] for window in fullAudio: # Run a window of appropriate length across the audio file voiceVector.append (hasHumanVoice( window, threshold, samplingRate) 
+3
source share

webrtcvad is a Python shell around Google excellent WebRTC Realization of voice activity (VAD) is the best work of any VAD that I used, how correctly to classify human speech, even with noisy sound.

To use it for your purpose, you would do something like this:

  • Convert the file to 8 kHz or 16 kHz, 16 bit monaural. This is required by the WebRTC code.
  • Create a VAD object: vad = webrtcvad.Vad()
  • Divide the sound into 30 millisecond fragments.
  • Check each snippet to see if it contains speech: vad.is_speech(chunk, sample_rate)

The VAD output can be โ€œnoisy," and if it classifies one 30-millisecond piece of sound as speech, you really don't want to output that time. You probably want to look at the last 0.3 seconds (or so) of the sound and see if most of the 30 millisecond fragments in this period are classified as speech. If they are, then you start the start time of this 0.3 second period as the start of speech. Then you do something similar to determine when the speech ends. Wait until the 0.3 second audio period, where most of the 30 millisecond pieces are not classified as VAD speech - when this happens, output the end time as the end of the speech.

You may need to adjust the time a little to get good results for your goals - you may have decided that you need 0.2 seconds of audio, where more than 30% of the pieces are classified as VAD speech before you start, and 1.0 seconds of audio with over 50% of non-speech classifieds before you deactivate.

A ring buffer ( collections.deque in Python) is a useful data structure for tracking the last N pieces of sound and classifying them.

+2
source share

All Articles