I do not think that this is possible as a thesis for one person if you try to do it all from scratch. But it can be feasible if you combine existing fragments together.
First, I will search for open source libraries and try them out as they are. This may impose some restrictions on what you can do. But this is wonderful, because all this is quite large. It might be advisable to integrate a quick and dirty solution first. For example, taking a recorded sound file and using the library to recognize sounds. Then add integration with other materials, fancy output, audio recording, etc.
I mean something like this: https://dsp.stackexchange.com/a/2462
There may not be much open material, as commercial interest in such things seems high.
source share