What can I do to speed up the work of Stanford CoreNLP (dcoref / ner)?

I process a large number of documents using the Stanford CoreNLP library along with the Stanford CoreNLP Python Wrapper . I use the following annotations:

tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref 

along with the parser model with shift reduction englishSR.ser.gz . I mainly use CoreNLP for its joint reference resolution / named object recognition, and as far as I know, I use a minimal set of annotators for this purpose.

What methods can be used to speed up annotation of documents?

Other SO answers all suggest not loading models for each document, but I already do this (since the wrapper starts the server once, and then transfers the documents / results back and forth).

The documents that I process have an average length of 20 sentences, the number of such sentences being up to 400 sentences, and some as 1. The average parsing time per sentence is 1 second. I can parse ~ 2500 documents per day with one single-threaded process running on the same machine, but I would like to double it (if not more).

+6
source share
2 answers

Try setting up a Stanford CoreNLP server than loading annotators every time you start. This way you can load annotators once and process documents faster. The first process will be slower, but everything else will be much faster. See the Stanford CoreNLP Server for more information .

Having said that, this is often a compromise between accuracy and speed. Thus, you may want to do the due diligence with other tools like NLTK and spacy to find out what works best for you.

+1
source

It should be noted that the length of sentences has a very large impact on the parsing time of some parts of the main library of NLP. I would recommend not trying to parse offers that contain more than 100 tokens.

One way to approach this is to have two different pipelines: a tokenizer / offer separator and a full pipeline. An offer splitter conveyor can determine how long the offer is, and then you can decide whether you want to somehow reduce its length (for example, by ignoring the offer or dividing it into several sentences). The second pipeline only works with documents / proposals that are less than the maximum allowable length.

Although this approach does not accelerate the average case, it can significantly improve worst case performance. The trade-off is that there are legitimate offers that are longer than you expected.

+1
source

All Articles