I am trying to apply the Spacy NLP (Natural Language Processing) trace to a large text file, such as on Wikipedia Dump. Here is my code based on the Spacy example:
from spacy.en import English
input = open("big_file.txt")
big_text= input.read()
input.close()
nlp= English()
out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1)
doc = out.next()
Spacy applies all nlp operations, such as POS marking, Lemmatizing, etc. all at once. It is like a conveyor belt for NLP that takes care of everything you need in one step. The use of the pipe method is supposed to make the process much faster due to the multithreading of expensive parts of the pipeline. But I don’t see a big improvement in speed, and my CPU usage is about 25% (only one of the 4 cores works). I also tried to read the file in several chunks and increase the batch of input texts:
out = nlp.pipe([part1, part2, ..., part4], n_threads=-1)
. ? , OpenMP , Spacy . , Windows, .