Python is equivalent to R caTools random 'sample.split'

Question

Python is equivalent to R caTools random 'sample.split'

Is there any Python (possibly pandas ) equivalent to R

 install.packages("caTools") library(caTools) set.seed(88) split = sample.split(df$col, SplitRatio = 0.75)

which will generate the exact same split value?

My current context for this is, for example, getting Pandas data that exactly matches the R ( qualityTrain , qualityTest ) data frames created by:

 # https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv quality = read.csv("quality.csv") set.seed(88) split = sample.split(quality$PoorCare, SplitRatio = 0.75) qualityTrain = subset(quality, split == TRUE) qualityTest = subset(quality, split == FALSE)

+1

split pandas random r dataframe

orome Mar 19 '14 at 13:36

source share

4 answers

Greg · Answer 1 · 2014-03-19T14:27:05+0000

I think the scikit-learn train_test_split might work for you ( link ).

 import pandas as pd from sklearn.cross_validation import train_test_split url = 'https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv' quality = pd.read_csv(url) train, test = train_test_split(quality, train_size=0.75, random_state=88) qualityTrain = pd.DataFrame(train, columns=quality.columns) qualityTest = pd.DataFrame(test, columns=quality.columns)

Unfortunately, I do not get the same lines as the function R. I assume this is sowing, but may be wrong.

noleto · Answer 2 · 2015-09-29T09:37:30+0000

Separation using sample.split from the caTools library means that class distribution is preserved. The Scikit-learn train_test_split does not guarantee that (it splits the data set into random trains and test subsets).

You can get the equivalent result as R caTools library (relative to class distribution) using sklearn.cross_validation.StratifiedShuffleSplit instead

 sss = StratifiedShuffleSplit(quality['PoorCare'], n_iter=1, test_size=0.25, random_state=0) for train_index, test_index in sss: qualityTrain = quality.iloc[train_index,:] qualityTest = quality.iloc[test_index,:]

yatinla · Answer 3 · 2016-04-22T14:21:08+0000

I know this is an old thread, but I just found that it was looking for some potential solution, because for the many online classes in statistics and machine learning that are taught in R, if you want to use Python, you come across this the problem is, all classes say do set.seed () in R, and then you use something like caTools sample.split, and you should get the same split, or your result will not be the same later, and you won’t be able to get right answer the question about the quiz or exercise. One of the main problems is that although both Python and R use the default Mercenne Twister algorithm to generate pseudorandom numbers, I found by looking at the random states of their corresponding prng that they would not produce the same result with the same seed. And one (I forget that) uses signed numbers and other unsigned numbers, so it seems like there is little hope that you can find a seed to use with Python that will produce the same series of numbers as R.

Anugraha sinha · Answer 4 · 2018-01-09T06:17:48+0000

A small correction in the above, StatifiedShuffleSplit is now part of sklearn.model_selection.

I have some data with X and Y in different numpy arrays. The distribution of 1s versus 0s in my Y array is about 4.1%. If I use StatifiedShuffleSplit , it supports this distribution in a set of tests and trainings created after wards. See below.

 full_data_Y_np.sum() / len(full_data_Y_np) 0.041006701187937859 for train_index, test_index in sss.split(full_data_X_np, full_data_Y_np): X_train = full_data_X_np[train_index] Y_train = full_data_Y_np[train_index] X_test = full_data_X_np[test_index] Y_test = full_data_Y_np[test_index] Y_train.sum() / len(Y_train) 0.041013925152306355 Y_test.sum() / len(Y_test) 0.040989847715736043

Python is equivalent to R caTools random 'sample.split'

More articles: