It depends on the vector you use.
CountVectorizer counts the occurrences of words in documents. It displays a vector (n_words, 1) for each document with the number of times each word appears in the document. n_words is the total number of words in the documents (aka the size of the dictionary).
It is also suitable for vocabulary so you can explore the model (see What word is important, etc.). You can take a look at it with vectorizer.get_feature_names() .
When you enter it in your first 500 documents, the vocabulary will be made only of words from 500 documents. Let's say there are 30 thousand of them, fit_transform displays a sparse matrix 500x30k .
Now you fit_transform again with the 500 following documents, but they contain only 29k words, so you get a 500x29k matrix ...
Now, how do you align your matrices to make sure all documents have a consistent view?
I can't think of an easy way to do this at the moment.
With TfidfVectorizer you have another problem, that is, the frequency of the reverse document: in order to be able to calculate the frequency of the document, you need to immediately see all the documents.
However, a TfidfVectorizer is just a CountVectorizer followed by a TfIdfTransformer , so if you can get the output from the CountVectorizer right, you can apply TfIdfTransformer to the data.
With HashingVectorizer, everything is different: there is no vocabulary.
In [51]: hvect = HashingVectorizer() In [52]: hvect.fit_transform(X[:1000]) <1000x1048576 sparse matrix of type '<class 'numpy.float64'>' with 156733 stored elements in Compressed Sparse Row format>
Here in the first 1000 documents there are no 1M + different words, but the matrix that we get has 1M + columns.
HashingVectorizer does not store words in memory. This makes it more memory efficient and ensures that the returned matrices always have the same number of columns . Thus, you do not have the same problem as with the CountVectorizer here.
This is probably the best batch processing solution you have described. There are several drawbacks, namely that you cannot get the idf weight and that you don't know the matching of words and your functions.
The HashingVectorizer documentation refers to an example that makes text classification based on the kernel . It may be a little dirty, but it does what you would like to do.
Hope this helps.
EDIT : If you have too much data, HashingVectorizer is the way to go. If you still want to use the CountVectorizer , a possible workaround is to fit the dictionary and pass it to your vector pointer, so you only need to call tranform .
Here is an example that you can adapt:
import re from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer news = fetch_20newsgroups() X, y = news.data, news.target
Now an approach that does not work:
# Fitting directly: vect = CountVectorizer() vect.fit_transform(X[:1000]) <1000x27953 sparse matrix of type '<class 'numpy.int64'>' with 156751 stored elements in Compressed Sparse Row format>
Pay attention to the size of the matrix that we get. Manual dictionary installation:
def tokenizer(doc): # Using default pattern from CountVectorizer token_pattern = re.compile('(?u)\\b\\w\\w+\\b') return [t for t in token_pattern.findall(doc)] stop_words = set() # Whatever you want to have as stop words. vocabulary = set([word for doc in X for word in tokenizer(doc) if word not in stop_words]) vectorizer = CountVectorizer(vocabulary=vocabulary) X_counts = vectorizer.transform(X[:1000]) # Now X_counts is: # <1000x155448 sparse matrix of type '<class 'numpy.int64'>' # with 149624 stored elements in Compressed Sparse Row format> # X_tfidf = tfidf.transform(X_counts)
In your example, you first need to create the entire X_counts matrix (for all documents) before applying the tfidf transform.