NGramModel train in Python

I am using Python 3.5, installed and managed using Anaconda. I want to prepare NGramModel (from nltk) using some text. My installation does not find the nltk.model module

There are several possible answers to this question (choose the right one and explain how to do it):

  • Another version of nltk can be installed using conda, so that it contains a model module. This is not only an older version (it must be too old), but also another version containing the model branch (or model2) of the current nltk development.
  • The nltk version mentioned at the previous point cannot be installed using conda, but can be installed using pip.
  • nltk.model is deprecated, it is better to use a different package (explain which package)
  • There are better options than nltk for learning the ngram model, using some other library (explain which library)
  • None of the above, for training the ngram model, the best option is something else (explain that).
+4
source share
1 answer

First of all, as pointed out in the comments on your question, if you have a training / running problem for you, KenLM is probably the best choice. Currently nltk.modeldesigned primarily for training / prototyping purposes, this is not done quickly.

NLTK, . NgramModel model, , , . - , .

. , . , , . .

, ngram 3 .

1.

, ngram , . "" () , .

ngram, , , UNKNOWN. , , UNKNOWN .

, , . , Gigaword, Wall Street Journal.

nltk.model.build_vocabulary.

2. = N

- , , "" ngram? , ngrams . , ngrams , .

nltk.model.count_ngrams , . NgramCounter, / ngram.

3. Counts to Scores ()

, - .

, MLE, Lidstone , doctest.

, . . , NLTK !

:

, ngram.py. MLE:

from nltk.model import BaseNgramModel

class MLENgramModel(BaseNgramModel):

    def score(self, context, word):
        # how many times word occurs with context
        ngram_count = self.ngrams[context][word]
        # how many times the context itself occurred we take advantage of
        # the fact that self.ngram[context] is a FreqDist and has a method
        # FreqDist.N() which counts all the samples in it.
        context_count = self.ngram[context].N()

        # In case context_count is 0 we shouldn't be dividing by it 
        # and just return 0
        if context_count == 0:
            return 0
        # otherwise we can return the standard MLE score
        return ngram_count / context_count
+10

All Articles