Sklearn TFIDF vectorizer for working as parallel jobs

How to run sklearn TFIDF vectorizer (and COUNT vector pointer) to do parallel jobs? Something similar to the n_jobs = -1 parameter in other sklearn models.

+10
python scikit-learn
source share
2 answers

This is impossible, because there is no way to parallelize / distribute access to vocabulary, which is necessary for these vectorizers.

To parallelize a vector of a vector, use an example using this vectorizer to train (and evaluate) the classifier in batches. A similar workflow also works for parallelization, since the input terms are mapped to the same vector indexes without any connection between the parallel workers.

Just compute the partial term-doc matrices separately and combine them as soon as all tasks are completed. At this point, you can also run TfidfTransformer in a concatenated matrix.

The most significant drawback of not preserving the vocabulary of input terms is that it is difficult to determine which terms are mapped to which column in the final matrix (i.e., the inverse transform). The only effective mapping is to use the hash function in terms to see which column / index is assigned to. For the reverse conversion, you will need to do this for all unique terms (i.e. your dictionary).

+11
source share

The previous answer is good, however I would like to expand the example and that the HashingVectorizer will be deprecated. I give here a separate example where you can see the past tense. Basically, after fitting a vectorizer (which is difficult to parallelize), you can convert (this bit is easier to parallelize).

You have something like this to fit the model:

 print("Extracting tf-idf features") tfidf_vectorizer = TfidfVectorizer(stop_words='english') t0 = time() tfidf = tfidf_vectorizer.fit(data_pd['text']) print("done in %0.3fs." % (time() - t0)) 

You have something like this for data conversion:

 print("Transforming tf-idf features...") tfidf = tfidf_vectorizer.transform(data_pd['text']) print("done in %0.3fs." % (time() - t0)) 

This is a bit that you can parallelize, I recommend something like this:

 import multiprocessing import pandas as pd import numpy as np from multiprocessing import Pool import scipy.sparse as sp num_cores = multiprocessing.cpu_count() num_partitions = num_cores-2 # I like to leave some cores for other #processes print(num_partitions) def parallelize_dataframe(df, func): a = np.array_split(df, num_partitions) del df pool = Pool(num_cores) #df = pd.concat(pool.map(func, [a,b,c,d,e])) df = sp.vstack(pool.map(func, a), format='csr') pool.close() pool.join() return df def test_func(data): #print("Process working on: ",data) tfidf_matrix = tfidf_vectorizer.transform(data["text"]) #return pd.DataFrame(tfidf_matrix.toarray()) return tfidf_matrix #df = pd.DataFrame({'col': [0,1,2,3,4,5,6,7,8,9]}) #df = data_pd tfidf_parallel = parallelize_dataframe(data_pd, test_func) 

The previous solution is an adaptation from here .

I hope this helps. In my case, this significantly reduces the time.

+2
source share

All Articles