Python pandas: conditionally select a single sample from a data frame

Say I have a data frame as such

category1 category2 other_col another_col .... a 1 a 2 a 2 a 3 a 3 a 1 b 10 b 10 b 10 b 11 b 11 b 11 

I want to get a sample from my data frame so that category1 an even number of times. I assume that in category1 there is an equal amount of each type. I know that this can be done using pandas using pandas.sample() . However, I also want the selection that I selected also has category2 . So, for example, if I have a sample size of 5, I would like something like:

 a 1 a 2 b 10 b 11 b 10 

I would not want something like:

 a 1 a 1 b 10 b 10 b 10 

As long as this is a valid random sample of n=4 , it will not meet my requirements, since I want to vary category2 types as much as possible.

Note that in the first example, since a was only twice selected, this 3 not represented from category2 . This is normal. The goal is to present the sample data as evenly as possible.

If this helps provide a clearer example, one could have the categories fruit , vegetables , meat , grains , junk . With a sample size of 10, I would like to represent each category as much as possible. So perfect, 2 of each. Then, each of these two selected rows belonging to the selected categories will have subcategories that are also presented as evenly as possible. So, for example, fruits can have subcategories red_fruits, yellow_fruits, etc. For two categories of fruits that are selected from 10, red_fruits and yellow_fruits will be presented in the sample. Of course, if we had a larger sample size, we would include more subcategories of fruits (green_fruits, blue_fruits, etc.).

+6
source share
1 answer

Trick creates a balanced array. I provided a clumsy way to do this. Then loop through the group object by referencing a balanced array.

 def rep_sample(df, col, n, *args, **kwargs): nu = df[col].nunique() m = len(df) mpb = n // nu mku = n - mpb * nu fills = np.zeros(nu) fills[:mku] = 1 sample_sizes = (np.ones(nu) * mpb + fills).astype(int) gb = df.groupby(col) sample = lambda sub_df, i: sub_df.sample(sample_sizes[i], *args, **kwargs) subs = [sample(sub_df, i) for i, (_, sub_df) in enumerate(gb)] return pd.concat(subs) 

Demonstration

 rep_sample(df, 'category1', 5) 

enter image description here

0
source

All Articles