Writing a speech recognition engine

So, like many others, I decided to create my own speech recognition engine. As it turned out, this is not easy, but rather difficult to do for the English language, in particular, because there is, I would say, a dramatic difference between how the word is written and how it is pronounced. Being from Georgia, I decided to write a speech recognition for the Georgian language. In Georgian, you pronounce the words EXACTLY how you write them. It is like transcription. Is this fact much easier my task? Or even harder ... difficulties: D?

+7
source share
2 answers

Speech recognition is a complex domain with many specific algorithms, tools and methods. To create your own engine, you can start with the CMUSphinx open source speech recognition toolkit, which allows you to:

  • Collection and processing of data necessary to support the Georgian language.
  • Create models for Georgian
  • Introduce a mechanism for speech recognition in Georgian.
  • Use the engine to create a speech recognition application that runs on the desktop, on the server, or on the iPhone (via OpenEars).

CMUSphinx already supports English, German, Spanish, French, Dutch, Russian, Mandarin, Icelandic, Italian and many other languages. It is very simple to add a new one. For new people, it usually takes a month or two of concentrated work to complete the required process.

To get started, visit the homepage:

http://cmusphinx.sourceforge.net

and read the tutorial

http://cmusphinx.sourceforge.net/wiki/tutorial

If you have any questions, please post them on the forums or here!

And this is a very common misconception that you just make sounds when you speak Georgian. This does not apply to most languages ​​in the world. To test the hypothesis, try recording some audio in the audio editor and check which sounds are actually being pronounced. You will be surprised. This tutorial details this issue.

+9
source

Do all people from Georgia sound exactly the same? I think not ... many serious problems in speech recognition are not directly related to the language itself:

  • different people (women, men, children, elders, etc.) have different voices.
  • sometimes the same person sounds different, for example, when a person has a cold
  • various background noises
  • everyday speech sometimes contains words from other languages ​​(for example, you have the German word Kindergarden in US / English).
  • some people from outside the country studied the language (they usually sound different)
  • some people speak faster, others speak slower.
  • microphone quality
    etc.

Solving these things is always quite difficult ... besides, you have a language / pronunciation to take care ... I don’t know Georgian, but what you described can make the task a little easier, but it will still be a difficult task.

EDIT - according to the comments:

Using good libraries can reduce the time frame and even help in quality ... but not every library is good for speech recognition, although it may have been brilliant on some other audio related issues ...

For reference, see the Wikipedia article http://en.wikipedia.org/wiki/Speech_recognition - it has a good overview, including some links and book links, which are a good starting point ...

How to design such an API see, for example, http://java.sun.com/products/java-media/speech/forDevelopers/jsapi-guide/Recognition.html

+5
source

All Articles