How to recognize a named object that is low, like kobe bryant from CoreNLP?

I had a problem with the fact that CoreNLP can only recognize a named object, such as Kobe Bryant, which starts with a capital letter char, but cannot recognize kobe bryant as a person !!! So, how to recognize a named object starting with lowercase char from CoreNLP ???? Appreciate it !!!!

+1
source share
3 answers

First, you need to accept that it is more difficult to get named objects directly in a lowercase or inconsistent English text than in formal text, where capital letters are a great key. (This is also one of the reasons why Chinese NER is more complicated than English NER.) However, there are things you need to do to make CoreNLP work well enough with lowercase text - the default models are trained to work well with well-edited text.

, . , , ( ) , , . (, ), truecaser , , , NER ( , ner.model).

1: . , . .

2: truecaser. truecase, . , .

, , . .

: , , jar, .

. :

% cat lakers.txt
lonzo ball talked about kobe bryant after the lakers game.

, . !

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -file lakers.txt -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner
% cat lakers.txt.conll 
1   lonzo   lonzo   NN  O   _   _
2   ball    ball    NN  O   _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   kobe    kobe    NN  O   _   _
6   bryant  bryant  NN  O   _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   lakers  laker   NNS O   _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

, : , . .

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner -file lakers.txt -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
% cat lakers.txt.conll 
1   lonzo   lonzo   NNP PERSON  _   _
2   ball    ball    NNP PERSON  _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   kobe    kobe    NNP PERSON  _   _
6   bryant  bryant  NNP PERSON  _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   lakers  lakers  NNPS    O   _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

truecasing POS NER:

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,truecase,pos,lemma,ner -file lakers.txt -truecase.overwriteText
% cat lakers.txt.conll 
1   Lonzo   Lonzo   NNP PERSON  _   _
2   ball    ball    NN  O   _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   Kobe    Kobe    NNP PERSON  _   _
6   Bryant  Bryant  NNP PERSON  _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   Lakers  Lakers  NNPS    ORGANIZATION    _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

Lakers , , , . , , , .

+6

NER , , Trucase Stanford NLP . , truecase . , trucase, , . ,

" ."

Trucase .

.

.

.

.

0

( EMNLP 2019): https://arxiv.org/abs/1903.11222

In this article, we experiment with several different ways to solve this exact problem (including the 2 mentioned by @ christopher-manning above). TLDR, main findings:

  1. Using truecaser for test data is a bad idea because truecasers work worse than you think.
  2. Caseless models work pretty well.
  3. But in general, the best option is to supplement the initial training data with case-insensitive data (simply train_data.lower()) and retrain the model.
0
source

All Articles