Try either lil_matrix or dok_matrix ; they are easy to build and test (but in the case of lil_matrix , it is potentially very slow, since each insert takes linear time). Scikit-learn evaluators accepting sparse matrices will accept any format and convert them to an efficient format internally (usually csr_matrix ). You can also do the conversion yourself using the methods tocoo , todok , tocsr , etc. On scipy.sparse matrices.
Or simply use the CountVectorizer or DictVectorizer classes that scikit-learn provides for just that purpose. CountVectorizer accepts all documents as input:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> documents = ["The dog ran", "The boy ran"] >>> vectorizer = CountVectorizer(min_df=0) >>> vectorizer = CountVectorizer(min_df=0, stop_words=[]) >>> X = CountVectorizer.fit_transform(documents) >>> X = vectorizer.fit_transform(documents) >>> X.toarray() array([[0, 1, 1, 1], [1, 0, 1, 1]])
... while the DictVectorizer assumes that you have already done tokenization and counting, with the result of this in a dict for the sample:
>>> from sklearn.feature_extraction import DictVectorizer >>> documents = [{"the":1, "boy":1, "ran":1}, {"the":1, "dog":1, "ran":1}] >>> X = vectorizer.fit_transform(documents) >>> X.toarray() array([[ 1., 0., 1., 1.], [ 0., 1., 1., 1.]]) >>> vectorizer.inverse_transform(X[0]) [{'ran': 1.0, 'boy': 1.0, 'the': 1.0}]
(The min_df argument for CountVectorizer was added several issues ago. If you are using an older version, omit it or, rather, upgrade it.)
EDIT According to the FAQ, I have to disclose my affiliation, so like this: I am the author of DictVectorizer , and I also wrote parts of the CountVectorizer .