Not a trivial task. Creating a language model is a task of time and resources.
If you want to have a “good” language model, you will need a large or very large text corpus to teach the language model (think in the order of magnitude of several years of transcripts of verbatim transcripts).
“good” means: if the language model can generalize learning data to new and previously invisible input
You should read the documentation for the Sphinx and HTK model tools.
http://cmusphinx.sourceforge.net/wiki/tutoriallm
Also check out these two threads:
Creating a compatible openears language model
Ruby Text Analysis
You can use a more general language model based on a larger case and interpolate your smaller language model with it. For example, a backup language model ... but this is not a trivial task.
see below: Katz Return Model
Tilo
source share