FFT algorithm: what is included in IN / OUT? (re: real-time pitch detection)

I am trying to extract pitch data from an audio stream. From what I see, it seems that FFT is the best algorithm to use.

Instead of delving right into the math, can someone help me figure out what this FFT algorithm does?

Please do not say anything obvious like "FFT extracts frequency data from a raw signal." I need the next level of detail.

Why should I go, and what should I do?

Once I understand the interface explicitly, this will help me understand the implementation.

I believe that I need to transfer an audio buffer, I need to say how many bytes to use for each calculation (say, the last 1024 bytes from this buffer). and maybe I need to specify the range of steps that I want to detect. Now is he going to convey something? Lots of frequency bins? What is it?

(Edit :) I found a C ++ algorithm to use (if I can only figure it out)

Performous extracts the pitch from the microphone. Also, the code is open source. Here is a description of what the algorithm does, from the guy who encoded it.

  • PCM input (buffered)
  • FFT (1024 samples at a time, then removes 200 samples from the front of the buffer)
  • The reassignment method (versus the previous FFT, which was previously 200 samples)
  • Peak filtering (this part can be done much better or even eliminated).
  • The combination of peaks with harmonic sets (we call the combination of timbre)
  • Temporal filtering of tones (update the set of previously detected tones instead of simply using just discovered tones)
  • Choose the best vocal tone (frequency limits, weighting, you can also use a harmonic array, but I don’t think we do it)

But can anyone help me understand how this works? What is sent from the FFT to the reassignment method?

+4
source share
3 answers

There is an element of choice here. The easiest to implement is to perform (2 ^ n samples in) complex numbers in, and 2 ^ n complex numbers, so maybe you should start with this.

In the particular case of DCT (Discrete Cosine Transformation), usually what happens is 2 ^ n samples (often floats), and out go 2 ^ n, often floats. DCT is an FFT, but it takes only real values ​​and analyzes the function in terms of cosines.

This is a smart (but usually overlooked) structure definition for handling complex values. Traditionally, FFTs are performed locally, but it works great if you do not.

It may be useful to create an instance of the class that contains the working buffer for the FFT (if you do not want to do the FFT in place) and reuse this for multiple FFTs.

+2
source

FFT is just one building block in this process, and this may not be the best approach for determining pitch. Read the pitch definition and decide which algo you want to use first (it will depend on what you are trying to measure the pitch of the speech, one musical instrument, other types of sound, etc. Get it right before, such as FFT (some, but not all pitch algorithms use FFT internally).

There are many similar questions on SO already, for example. Real-time pitch detection using FFT and Pitch detection using FFT for the pipe , and there is good review material on Wikipedia , etc. - read them and then decide if you want to rotate your own FFT-based solution or perhaps use the existing library that is suitable for your specific application.

+3
source

Includes N PCM samples (purely real complex numbers). Out outputs N frequency-domain bins (each bit corresponding to a 1 / N slice of the sampling frequency). Each bit is a complex number. Instead of real and imaginary parts, these values ​​should usually be processed in polar format (absolute value and argument). The absolute value indicates the amount of sound near the center frequency of the hopper, while the argument indicates the phase (in which position the sine wave is).

Most often, encoders use only a value (absolute value) and throw out the phase angle (argument).

+1
source

All Articles