The previous answer is good, however I would like to expand the example and that the HashingVectorizer will be deprecated. I give here a separate example where you can see the past tense. Basically, after fitting a vectorizer (which is difficult to parallelize), you can convert (this bit is easier to parallelize).
You have something like this to fit the model:
print("Extracting tf-idf features") tfidf_vectorizer = TfidfVectorizer(stop_words='english') t0 = time() tfidf = tfidf_vectorizer.fit(data_pd['text']) print("done in %0.3fs." % (time() - t0))
You have something like this for data conversion:
print("Transforming tf-idf features...") tfidf = tfidf_vectorizer.transform(data_pd['text']) print("done in %0.3fs." % (time() - t0))
This is a bit that you can parallelize, I recommend something like this:
import multiprocessing import pandas as pd import numpy as np from multiprocessing import Pool import scipy.sparse as sp num_cores = multiprocessing.cpu_count() num_partitions = num_cores-2
The previous solution is an adaptation from here .
I hope this helps. In my case, this significantly reduces the time.
Rafael valero
source share