NLTK calls face recognition Dutch

I am trying to extract named objects from Dutch text. I used nltk-trainer to train the tagger and chunker on the body of the Dutch conll2002. However, the parse method from chunker does not detect any named objects. Here is my code:

str = 'Christiane heeft een lam.' tagger = nltk.data.load('taggers/dutch.pickle') chunker = nltk.data.load('chunkers/dutch.pickle') str_tags = tagger.tag(nltk.word_tokenize(str)) print str_tags str_chunks = chunker.parse(str_tags) print str_chunks 

And the output of this program:

 [('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')] (S Christiane/N heeft/V een/Art lam/Adj ./Punc) 

I was expecting Christiane to be discovered as a named object. Any help?

+7
source share
1 answer

The conll2002 body has both Spanish and Dutch text, so you should use the fileids parameter, as in python train_chunker.py conll2002 --fileids ned.train . Studying in Spanish and Dutch will have poor results.

The default algorithm is a Tagger based Chunker that does not work on conll2002. Instead, use a classifier-based chunker such as NaiveBayes, so the full command might look like this (and I confirmed that the resulting chunker recognizes "Christiane" as "PER"):

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

+7
source

All Articles