How to split data (raw text) into test / rail kits with scikit cross-encryption module?

I have a large body of opinions (2500) in raw text. I would like to use the scikit-learn library to break them down into tests / sets for trains. What could be the best approach to solving this problem with scikit-learn ?. Can someone provide me an example of source code splitting in test / train sets (maybe I will use tf-idf view).

+7
scikit-learn machine-learning classification text-classification cross-validation
source share
1 answer

Suppose your data is a list of rows, i.e.

data = ["....", "...", ] 

Then you can divide it into training (80%) and test (20%) using train_test_split , for example. by doing:

 from sklearn.cross_validation import train_test_split train, test = train_test_split(data, train_size = 0.8) 

Before rushing with this, read these documents . 2500 is not a "big case", and you probably want to do something like cross-validating k-fold, rather than sharing a single bay.

+19
source share

All Articles