I process a large number of documents using the Stanford CoreNLP library along with the Stanford CoreNLP Python Wrapper . I use the following annotations:
tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref
along with the parser model with shift reduction englishSR.ser.gz . I mainly use CoreNLP for its joint reference resolution / named object recognition, and as far as I know, I use a minimal set of annotators for this purpose.
What methods can be used to speed up annotation of documents?
Other SO answers all suggest not loading models for each document, but I already do this (since the wrapper starts the server once, and then transfers the documents / results back and forth).
The documents that I process have an average length of 20 sentences, the number of such sentences being up to 400 sentences, and some as 1. The average parsing time per sentence is 1 second. I can parse ~ 2500 documents per day with one single-threaded process running on the same machine, but I would like to double it (if not more).
source share