Objects in my newspaper are not recognized

I would like to create a custom NER model. What I've done:

DATA TRAINING (stanford-ner.tsv):

Hello O ! O My O name O is O Damiano PERSON . O 

PROPERTIES (stanford-ner.prop):

 trainFile = stanford-ner.tsv serializeTo = ner-model.ser.gz map = word=0,answer=1 maxLeft=1 useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC useGazettes=true gazette=gazzetta.txt cleanGazette=true 

GAZZETTE gazzetta.txt):

 PERSON John PERSON Andrea 

I will build the model through the command line using

 java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop stanford-ner.prop 

And check with:

 java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt 

I did two tests with the following texts:

→> TEST 1 <

  • TEXT: Hello! My name is Damiano and this is fake text for testing.

  • OUTPUT Hi / O! / O My / O name / O is / O Damiano / PERSON and / O it is / O is / O a / O fake / O text / O to / O test / O. / O

→> TEST 2 <

  • TEXT: Hello! My name is John and this is fake text for testing.

  • OUTPUT Hi / O! / O My / O name / O is / O John / O and / O this / O is / O a / O fake / O text / O to / O test / O. / O

As you can see, the object "Damiano" was found. This object is in my training data, but John (second test) is inside the newspaper. So the question is.

Why is the entity John not recognized?

Thank you so much in advance.

+5
source share
3 answers

Like the Stanford FAQ ,

If a magazine is used, this does not guarantee that the words in gazette are always used as a member of the intended class, and this does not guarantee that words outside the newspaper will not be selected. It just provides another opportunity for CRF to learn against. If CRF has a higher weight for other functions, the functions are overloaded.

If you want something that recognizes text as a member of a class if and only if it is in the list of words, you may prefer either the regexner or the tokensregex tools included in Stanford CoreNLP. CRF NER is not guaranteed to accept all words in the newspaper as part of the expected class, and it may also accept words outside of gazette as part of the class.

Btw, this is not a good practice for testing machine learning pipelines in unit-test mode, that is, with only one or two examples, since it should work on a much larger amount of data and, more importantly, it is probabilistic.

If you want to check if your file is really used, it might be better to accept the existing examples (see the bottom of the page above for the examples austen.gaz.prop and austen.gaz.txt ) and replace several names with your own then check. If this fails, try changing the test, for example. add more names, reformulate the text, etc.

+3
source

gazzette will only help to extract additional functions from training data, if you do not have any appearance of these words inside your training data or any connection with marked tokens, your model will not bring any benefit from this. One experiment I would suggest is to add Damiano to your newspaper.

+1
source

Why is the entity John not recognized?

It seems to me that your minimal example should most likely add “Damiano” to the directory as the PERSON category. At present, the training data allows the model to find out that “Damiano” is a PERSON label, but I think that this is not related to the categories of the directory of geographical names (that is, WITH A MAN on both sides it is not enough).

0
source

All Articles