This is from the Lingpipe doc tutorial on building a language model. But I only partially understand the theory underlying it.
I especially do not know the basic probability.


Here's how to get the base p (d). If below - part of the token and their frequency in the unigram file.
ab 20 aba 3 abd 2 abef 2 abkk 3
Under such a condition, what are lamda (), 1-lamda (), extcount, numExtentions and Base P (ab)? This is one question, but they are connected by a chain.
Thank you very much.
source share