First of all, as pointed out in the comments on your question, if you have a training / running problem for you, KenLM is probably the best choice. Currently nltk.modeldesigned primarily for training / prototyping purposes, this is not done quickly.
NLTK, . NgramModel model, , , . - , .
. , .
, , . .
, ngram 3 .
1.
, ngram , . "" () , .
ngram, , , UNKNOWN. , , UNKNOWN .
, , . , Gigaword, Wall Street Journal.
nltk.model.build_vocabulary.
2. = N
- , , "" ngram? , ngrams . , ngrams , .
nltk.model.count_ngrams , . NgramCounter, / ngram.
3. Counts to Scores ()
, - .
, MLE, Lidstone , doctest.
, . . , NLTK !
:
, ngram.py.
MLE:
from nltk.model import BaseNgramModel
class MLENgramModel(BaseNgramModel):
def score(self, context, word):
ngram_count = self.ngrams[context][word]
context_count = self.ngram[context].N()
if context_count == 0:
return 0
return ngram_count / context_count