Scikit-learn: fit data to pieces and fit them all at once

I use scikit-learn to create a classifier that works with (somewhat large) text files. At the moment, I need simple word addition functions, so I tried using TfidfVectorizer / HashingVectorizer / CountVectorizer to get feature vectors.

However, processing all the train information immediately to obtain feature vectors leads to a memory error in numpy / scipy (depending on the vectorizer used). So my question is:

When extracting text functions from raw text: if I put the data in a vector object in pieces, will it be the same as immediately setting all the data?

To illustrate this with code, this is the following:

vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer train_vectors = vectoriser.fit_transform(train_data) 

different from the following:

 vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer start = 0 while start < len(train_data): vectoriser.fit(train_data[start:(start+500)]) start += 500 train_vectors = vectoriser.transform(train_data) 

Thanks in advance and sorry if this question is completely inhibited.

+5
source share
2 answers

It depends on the vector you use.

CountVectorizer counts the occurrences of words in documents. It displays a vector (n_words, 1) for each document with the number of times each word appears in the document. n_words is the total number of words in the documents (aka the size of the dictionary).
It is also suitable for vocabulary so you can explore the model (see What word is important, etc.). You can take a look at it with vectorizer.get_feature_names() .

When you enter it in your first 500 documents, the vocabulary will be made only of words from 500 documents. Let's say there are 30 thousand of them, fit_transform displays a sparse matrix 500x30k .
Now you fit_transform again with the 500 following documents, but they contain only 29k words, so you get a 500x29k matrix ...
Now, how do you align your matrices to make sure all documents have a consistent view?
I can't think of an easy way to do this at the moment.

With TfidfVectorizer you have another problem, that is, the frequency of the reverse document: in order to be able to calculate the frequency of the document, you need to immediately see all the documents.
However, a TfidfVectorizer is just a CountVectorizer followed by a TfIdfTransformer , so if you can get the output from the CountVectorizer right, you can apply TfIdfTransformer to the data.

With HashingVectorizer, everything is different: there is no vocabulary.

 In [51]: hvect = HashingVectorizer() In [52]: hvect.fit_transform(X[:1000]) <1000x1048576 sparse matrix of type '<class 'numpy.float64'>' with 156733 stored elements in Compressed Sparse Row format> 

Here in the first 1000 documents there are no 1M + different words, but the matrix that we get has 1M + columns.
HashingVectorizer does not store words in memory. This makes it more memory efficient and ensures that the returned matrices always have the same number of columns . Thus, you do not have the same problem as with the CountVectorizer here.

This is probably the best batch processing solution you have described. There are several drawbacks, namely that you cannot get the idf weight and that you don't know the matching of words and your functions.

The HashingVectorizer documentation refers to an example that makes text classification based on the kernel . It may be a little dirty, but it does what you would like to do.

Hope this helps.

EDIT : If you have too much data, HashingVectorizer is the way to go. If you still want to use the CountVectorizer , a possible workaround is to fit the dictionary and pass it to your vector pointer, so you only need to call tranform .

Here is an example that you can adapt:

 import re from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer news = fetch_20newsgroups() X, y = news.data, news.target 

Now an approach that does not work:

 # Fitting directly: vect = CountVectorizer() vect.fit_transform(X[:1000]) <1000x27953 sparse matrix of type '<class 'numpy.int64'>' with 156751 stored elements in Compressed Sparse Row format> 

Pay attention to the size of the matrix that we get. Manual dictionary installation:

 def tokenizer(doc): # Using default pattern from CountVectorizer token_pattern = re.compile('(?u)\\b\\w\\w+\\b') return [t for t in token_pattern.findall(doc)] stop_words = set() # Whatever you want to have as stop words. vocabulary = set([word for doc in X for word in tokenizer(doc) if word not in stop_words]) vectorizer = CountVectorizer(vocabulary=vocabulary) X_counts = vectorizer.transform(X[:1000]) # Now X_counts is: # <1000x155448 sparse matrix of type '<class 'numpy.int64'>' # with 149624 stored elements in Compressed Sparse Row format> # X_tfidf = tfidf.transform(X_counts) 

In your example, you first need to create the entire X_counts matrix (for all documents) before applying the tfidf transform.

+3
source

I am not an expert on extracting text objects, but based on documentation and my other basic experiences:

If I do a few tricks on pieces of training data, will it be the same as setting all the data right away?

You cannot directly combine the extracted functions because you will get different values, i.e. weights for the same token / word obtained from another fragment in different proportionally to other words of the piece represented by another .

You can use any method of function allocation, I think the usefulness of the result depends on the task.

But if you can use different functions of different different functions to classify by that . After you get several different outputs with functions that you purchased using the same extraction method (or you can use a different extraction method), you can use them as an input to the merge mechanism, for example bagging , boosting , etc. . In fact, after the whole process above, in most cases you will get the final output better than you uploaded the complete file into one "fully functional", but even a simple classifier.

+1
source

All Articles