For my project last year, I am trying to determine the sound of a dog / bark / bird in real time (by recording sound clips). I use MFCC as audio functions. Initially, I extracted a total of 12 MFCC vectors from a sound clip using the jAudio library. Now I am trying to teach machine learning algorithm (at the moment I have not solved the algorithm, but it is rather SVM). The size of the sound clip is approximately 3 seconds. I need to clarify some information about this process. They,
Do I need to train this algorithm using MFCC with frames (12 per frame) or or a common clip based on MFCC (12 per frame)?
To train the algorithm, should I consider all 12 MFCCs as 12 different attributes, or should I consider these 12 MFCCs as one attribute?
These MFCCs are common MFCCS for the clip,
-9.598802712290967 -21.644963856237265 -7.405551798816725 -11.638107212413201 -19.441831623156144 -2.780967392843105 -0.5792847321137902 -13.14237288849559 -4.920408873192934 -2.7111507999281925 -7.336670942457227 2.4687330348335212
Any help would be truly appreciated to overcome these problems. I could not find good help on Google. :)
source share