How to train French NER based on the stanford-nlp conditional random field model?

I discovered the stanford-NLP tools and found it really interesting. I am a French dataminer / datascientist, I love text analysis and would like to use your tools, but NER, not available in French, is quite perplexing to me.

I would like to make my own French NER, perhaps even provide it as a contribution to the package if it is considered worthy, so ... can you tell me about the CRF training requirements for the French NER based on stanford coreNLP?

Thanks.

+6
source share
1 answer

NB: I am not a Stanford tool developer or NLP expert. Just a lambda user who also needed such information at some point. Also note that some of the information below relates to official frequently asked questions: http://nlp.stanford.edu/software/crf-faq.shtml#a

Here are the steps I took to train my own NER:

  • Install java8
  • Create a train / test pattern. It should be in the form of .tsv files in the following format:

      Venez O dΓ©couvrir O lundi DAY le O nouvel O espace O de O vente O ODHOJS ORGANISATION 

    Depending on the original text format, you can create this sample using the SQL statement or other NLP tools. Labeling is the hardest part, as I do not know any other way to act than to do it manually.

  • Configure the model with this command:

     java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop prop.txt 

    where prop.txt also described here .

    This should create a new .jar containing the newly trained model.

  • Check model specifications:

     java -cp "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile test.tsv > test.res 

    The input test.tsv has the same format as the train.tsv file. The result in test.res has an extra column containing the predicted class NER. The final lines also display a summary in terms of accuracy, recall and F1.

  • Finally, you can use your NER for real data:

     java -cp "stanford-ner.jar:lib/*" -mx5g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt -outputFormat inlineXML > test.res 

Hope this helps.

+7
source

All Articles