Say I have a data frame as such
category1 category2 other_col another_col .... a 1 a 2 a 2 a 3 a 3 a 1 b 10 b 10 b 10 b 11 b 11 b 11
I want to get a sample from my data frame so that category1 an even number of times. I assume that in category1 there is an equal amount of each type. I know that this can be done using pandas using pandas.sample() . However, I also want the selection that I selected also has category2 . So, for example, if I have a sample size of 5, I would like something like:
a 1 a 2 b 10 b 11 b 10
I would not want something like:
a 1 a 1 b 10 b 10 b 10
As long as this is a valid random sample of n=4 , it will not meet my requirements, since I want to vary category2 types as much as possible.
Note that in the first example, since a was only twice selected, this 3 not represented from category2 . This is normal. The goal is to present the sample data as evenly as possible.
If this helps provide a clearer example, one could have the categories fruit , vegetables , meat , grains , junk . With a sample size of 10, I would like to represent each category as much as possible. So perfect, 2 of each. Then, each of these two selected rows belonging to the selected categories will have subcategories that are also presented as evenly as possible. So, for example, fruits can have subcategories red_fruits, yellow_fruits, etc. For two categories of fruits that are selected from 10, red_fruits and yellow_fruits will be presented in the sample. Of course, if we had a larger sample size, we would include more subcategories of fruits (green_fruits, blue_fruits, etc.).