A random example of a subset of a data frame in Pandas

Question

A random example of a subset of a data frame in Pandas

Let's say I have a data block with 100,000 records and you want to break it into 100 sections out of 1000 records.

How to take a random sample of 50 in size from only one of 100 sections. the data set is already arranged so that the first 1000 results are the first section of the next section of the next and so on.

many thanks

+17

python pandas sample random-sample

Wgp Jun 28 '16 at 20:17

source share

3 answers

Andy hayden · Answer 1 · 2016-06-28T20:39:54+0000

You can use the sample * method:

 In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"]) In [12]: df.sample(2) Out[12]: AB 0 1 2 2 5 6 In [13]: df.sample(2) Out[13]: AB 3 7 8 0 1 2

* In one of the DataFrames sections.

Note. If you have a larger sample size, the size of the DataFrame will cause an error if you do not try with a replacement.

 In [14]: df.sample(5) ValueError: Cannot take a larger sample than population when 'replace=False' In [15]: df.sample(5, replace=True) Out[15]: AB 0 1 2 1 3 4 2 5 6 3 7 8 1 3 4

jpjandrade · Answer 2 · 2016-06-28T20:24:16+0000

One solution is to use the choice function from numpy.

Suppose you need 50 entries out of 100, you can use:

 import numpy as np chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed = df.iloc[chosen_idx]

This, of course, does not consider your block structure. If you want a sample of 50 elements from block i , for example, you can do:

 import numpy as np block_start_idx = 1000 * i chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]

Generalcode · Answer 3 · 2019-06-12T18:10:58+0000

This is a good place for recursion.

 def main2(): rows = 8 # say you have 8 rows, real data will need len(rows) for int rands = [] for i in range(rows): gen = fun(rands) rands.append(gen) print(rands) # now range through random values def fun(rands): gen = np.random.randint(0, 8) if gen in rands: a = fun(rands) return a else: return gen if __name__ == "__main__": main2()

output: [6, 0, 7, 1, 3, 5, 4, 2]

A random example of a subset of a data frame in Pandas

More articles: