Why is a random dataset sample scale not a sample size? (pandas.sample () example)

Question

Why is a random dataset sample scale not a sample size? (pandas.sample () example)

When sampling from distributions of different sizes, I was surprised to notice that the execution time seems to scale mainly with the size of the data sample, and not with the number of samples. Example:

import pandas as pd import numpy as np import time as tm #generate a small and a large dataset testSeriesSmall = pd.Series(np.random.randn(10000)) testSeriesLarge = pd.Series(np.random.randn(10000000)) sampleSize = 10 tStart = tm.time() currSample = testSeriesLarge.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart))) tStart = tm.time() currSample = testSeriesSmall.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart))) sampleSize = 1000 tStart = tm.time() currSample = testSeriesLarge.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart))) tStart = tm.time() currSample = testSeriesSmall.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))

Output:

 sample 10 from 10000 values: 0.00126 s sample 10 from 10000000 values: 1.10504 s sample 1000 from 10000 values: 0.00122 s sample 1000 from 10000000 values: 1.15000 s

This seems controversial. I may be a solid one, but the problem seems to be like creating a list of random indexes, and I would expect the number of values selected to matter and the size of the data set not to matter much. I tried another implementation or two with similar results, and it seems to me that I just missed the fundamental problem.

My questions are twofold: (1) Is this a fundamental problem or a quirk of implementation in pandas? (2) Is there a faster approach that can be used to randomly select from large data sets in this way?

+8

python pandas random sampling

c_layton Mar 25 '17 at 1:21

source share

2 answers

This looks like an internal issue with numpy. I believe the pandas sample method calls numpy.random.choice . Let's see how numpy performs various array sizes and sample sizes.

Creating Arrays

 large = np.arange(1000000) small = np.arange(1000)

Sampling time without replacement

 %timeit np.random.choice(large, 10, replace=False) 10 loops, best of 3: 27.4 ms per loop %timeit np.random.choice(small, 10, replace=False) 10000 loops, best of 3: 41.4 µs per loop

Replacement Sampling Time

 %timeit np.random.choice(large, 10, replace=True) 100000 loops, best of 3: 11.7 µs per loop %timeit np.random.choice(small, 10, replace=True) 100000 loops, best of 3: 12.2 µs per loop

It is very interesting that when performing a sample without replacement, a large array takes up almost 3 orders of magnitude and it is three orders of magnitude more. This tells me that numpy randomly sorts the array and then takes the first 10 elements.

When sampling with replacement, each value is selected independently, so the timings are identical.

+4

Ted petrou Mar 25 '17 at 2:03

source share

John zwinck · Accepted Answer · 2017-03-25T02:00:27+0000

pandas.Series.sample() in your case comes down to the following:

 rs = np.random.RandomState() locs = rs.choice(axis_length, size=n, replace=False) return self.take(locs)

The rs.choice() part of rs.choice() :

 %timeit rs.choice(100000000, size=1, replace=False) 1 loop, best of 3: 9.43 s per loop

It takes about 10 seconds to generate one random number! If you divide the first argument by 10, it takes about 1 second. It is slow!

If you use replace=True , it is very fast. So one workaround for you, if you don't mind, is to have duplicate entries in the results.

The NumPy documentation for choice(replace=False) states:

This is equivalent to np.random.permutation (np.arange (5)) [: 3]

Which largely explains the problem - it generates a huge array of possible values, mixes them, and then takes the first N. This is the main cause of the performance problem and has already been mentioned as a problem in NumPy here: https://github.com/numpy/ numpy / pull / 5158

In NumPy, it seems to be difficult to fix, because people rely on the result of choice() not changing (between versions of NumPy) using the same random initial value.

Since your use case is pretty narrow, you can do something like this:

 def sample(series, n): locs = np.random.randint(0, len(series), n*2) locs = np.unique(locs)[:n] assert len(locs) == n, "sample() assumes n << len(series)" return series.take(locs)

This gives a much faster time:

 sample 10 from 10000 values: 0.00735 s sample 10 from 1000000 values: 0.00944 s sample 10 from 100000000 values: 1.44148 s sample 1000 from 10000 values: 0.00319 s sample 1000 from 1000000 values: 0.00802 s sample 1000 from 100000000 values: 0.01989 s sample 100000 from 1000000 values: 0.05178 s sample 100000 from 100000000 values: 0.93336 s

Why is a random dataset sample scale not a sample size? (pandas.sample () example)

More articles: