Divide the set into subsets with an equal number of elements

To conduct a psychological experiment, I have to divide the set of images (240), described by 4 features (real numbers), into 3 subsets with the same number of elements in each subset (240/3 = 80) in such a way that all the subsets are approximately balanced in relation to these signs (in terms of mean and standard deviation).

Can anyone suggest an algorithm to automate this? Are there any packages / modules in Python or R that I could use for this? Where to begin?

+4
source share
5 answers

If I understand your problem correctly, you can use random.sample() in python:

 import random pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here slen = len(pool) / 3 # we need 3 subsets set1 = set(random.sample(pool, slen)) # 1st random subset pool -= set1 set2 = set(random.sample(pool, slen)) # 2nd random subset pool -= set2 set3 = pool # 3rd random subset 
+3
source

I would solve it as follows:

  • Divide into 3 equal subsets.
  • Find out the average and variance of each subset. The measure of "unevenness" is built from them.
  • Compare each pair of elements, if the exchange will reduce the "unevenness", replace them. Continue until there are more pairs to compare, or the general unevenness below some arbitrary threshold is “good enough”.
+2
source

You can easily do this using the plyr library in R. Here is the code.

 require(plyr) # CREATE DUMMY DATA mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE)) # SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE)) 
+1
source

If you are still interested in a comprehensive search question. You have 240 choices of 80 choices for the first set, and then another 160 select 80 for the second set, after which the third is fixed. In total, this gives you:

120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140

Clearly this is not an option :)

+1
source

Order your items by reducing the distance of the Mahalanobis from the average; they will be ordered from the most unusual to the most boring, including the consequences of any correlations between measures.

Assign X [3 * i] X [3 * i + 1] X [3 * i + 2] subsets of A, B, C, choosing for each I the order A / B / C, which minimizes your measure of inconsistency.

Why is order decreasing? Statistically heavy objects will be assigned first, and the choice of a permutation in more subsequent rounds will be more likely to equalize the initial imbalances.

The point of this procedure is to maximize the likelihood that any outliers in a dataset will be assigned to individual subsets.

0
source

All Articles