pandas.Series.sample() in your case comes down to the following:
rs = np.random.RandomState() locs = rs.choice(axis_length, size=n, replace=False) return self.take(locs)
The rs.choice() part of rs.choice() :
%timeit rs.choice(100000000, size=1, replace=False) 1 loop, best of 3: 9.43 s per loop
It takes about 10 seconds to generate one random number! If you divide the first argument by 10, it takes about 1 second. It is slow!
If you use replace=True , it is very fast. So one workaround for you, if you don't mind, is to have duplicate entries in the results.
The NumPy documentation for choice(replace=False) states:
This is equivalent to np.random.permutation (np.arange (5)) [: 3]
Which largely explains the problem - it generates a huge array of possible values, mixes them, and then takes the first N. This is the main cause of the performance problem and has already been mentioned as a problem in NumPy here: https://github.com/numpy/ numpy / pull / 5158
In NumPy, it seems to be difficult to fix, because people rely on the result of choice() not changing (between versions of NumPy) using the same random initial value.
Since your use case is pretty narrow, you can do something like this:
def sample(series, n): locs = np.random.randint(0, len(series), n*2) locs = np.unique(locs)[:n] assert len(locs) == n, "sample() assumes n << len(series)" return series.take(locs)
This gives a much faster time:
sample 10 from 10000 values: 0.00735 s sample 10 from 1000000 values: 0.00944 s sample 10 from 100000000 values: 1.44148 s sample 1000 from 10000 values: 0.00319 s sample 1000 from 1000000 values: 0.00802 s sample 1000 from 100000000 values: 0.01989 s sample 100000 from 1000000 values: 0.05178 s sample 100000 from 100000000 values: 0.93336 s