How to split data (raw text) into test / rail kits with scikit cross-encryption module?

Question

How to split data (raw text) into test / rail kits with scikit cross-encryption module?

I have a large body of opinions (2500) in raw text. I would like to use the scikit-learn library to break them down into tests / sets for trains. What could be the best approach to solving this problem with scikit-learn ?. Can someone provide me an example of source code splitting in test / train sets (maybe I will use tf-idf view).

+7

scikit-learn machine-learning classification text-classification cross-validation

anon Sep 11 '14 at 17:44

source share

1 answer

KT. · Accepted Answer · 2014-09-11T17:57:11+0000

Suppose your data is a list of rows, i.e.

data = ["....", "...", ]

Then you can divide it into training (80%) and test (20%) using train_test_split , for example. by doing:

 from sklearn.cross_validation import train_test_split train, test = train_test_split(data, train_size = 0.8)

Before rushing with this, read these documents . 2500 is not a "big case", and you probably want to do something like cross-validating k-fold, rather than sharing a single bay.

How to split data (raw text) into test / rail kits with scikit cross-encryption module?

More articles: