Counting with scipy.sparse

I use the sklearn Python libraries. I have 150,000 offers.

I need an object that looks like an array, where each row is for sentences, each column corresponds to a word, and each element is the number of words in this sentence.

For example: if the two sentences were β€œDog ran” and β€œBoy ran”, I need

[ [1, 1, 1, 0] , [0, 1, 1, 1] ] 

(the order of the columns does not matter and depends on which column is assigned to which word)

My array will be sparse (each sentence will contain part of the possible words), so I use scipy.sparse.

 def word_counts(texts, word_map): w_counts = sp.???_matrix((len(texts),len(word_map))) for n in range(0,len(texts)-1): for word in re.findall(r"[\w']+", texts[n]): index = word_map.get(word) if index != None: w_counts[n,index] += 1 return w_counts ... nb = MultinomialNB() #from sklearn words = features.word_list(texts) nb.fit(features.word_counts(texts,words), classes) 

I want to know which rare matrix will be better.

I tried using coo_matrix but got an error:

TypeError: object 'coo_matrix' does not have attribute '__getitem __'

I looked at the documentation for COO, but was very confused by the following :

Sparse matrices can be used in arithmetic operations ...
COO format flaws ... does not directly support: arithmetic operations

I used dok_matrix and it worked, but I don't know if this works in this case.

Thanks in advance.

+6
source share
1 answer

Try either lil_matrix or dok_matrix ; they are easy to build and test (but in the case of lil_matrix , it is potentially very slow, since each insert takes linear time). Scikit-learn evaluators accepting sparse matrices will accept any format and convert them to an efficient format internally (usually csr_matrix ). You can also do the conversion yourself using the methods tocoo , todok , tocsr , etc. On scipy.sparse matrices.

Or simply use the CountVectorizer or DictVectorizer classes that scikit-learn provides for just that purpose. CountVectorizer accepts all documents as input:

 >>> from sklearn.feature_extraction.text import CountVectorizer >>> documents = ["The dog ran", "The boy ran"] >>> vectorizer = CountVectorizer(min_df=0) >>> vectorizer = CountVectorizer(min_df=0, stop_words=[]) >>> X = CountVectorizer.fit_transform(documents) >>> X = vectorizer.fit_transform(documents) >>> X.toarray() array([[0, 1, 1, 1], [1, 0, 1, 1]]) 

... while the DictVectorizer assumes that you have already done tokenization and counting, with the result of this in a dict for the sample:

 >>> from sklearn.feature_extraction import DictVectorizer >>> documents = [{"the":1, "boy":1, "ran":1}, {"the":1, "dog":1, "ran":1}] >>> X = vectorizer.fit_transform(documents) >>> X.toarray() array([[ 1., 0., 1., 1.], [ 0., 1., 1., 1.]]) >>> vectorizer.inverse_transform(X[0]) [{'ran': 1.0, 'boy': 1.0, 'the': 1.0}] 

(The min_df argument for CountVectorizer was added several issues ago. If you are using an older version, omit it or, rather, upgrade it.)

EDIT According to the FAQ, I have to disclose my affiliation, so like this: I am the author of DictVectorizer , and I also wrote parts of the CountVectorizer .

+6
source

All Articles