Find unique item combinations

I am trying to find all the unique groupings of a list of items. Below is the code:

x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS","TmaxKTYS","TminKBNA","TmaxKBNA", "TminKMEM","TmaxKMEM","TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT","TmaxKCLT","TminKCHS" ,"TmaxKCHS","TminKATL","TmaxKATL","TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH","TmaxKLTH" ,"TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA","TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX" ,"TmaxKLEX","TminKSDF","TmaxKSDF") zz <- sapply(seq_along(x), function(y) combn(x,y)) #Generates a list with of the combinations sapply(zz, function(z) t(unique(t(z)))) #Filter out all the duplicates 

However, the code causes my memory to run out of memory. Is there a better way to do this? I understand that I have a long list. thanks.

+4
source share
2 answers

To compute all unique subsets, you simply create all binary vectors with the same length as the cardinality of the original set of elements. If there are 39 elements, then you look at all binary vectors of length 39. Each element of each vector identifies yes or no, regardless of whether the element is in the corresponding subset.

Since there are 39 elements, and each of them can be either in or not in this subset, that is, 2 ^ 39 possible subsets. Excluding the empty set, i.e. The vector is all-0, you have 2 ^ 39 - 1 possible subsets.

That is, as @joran said, there are about 549B vectors. Given that binary vectors represent data most compactly (i.e., without strings), then you will need 549B * 39 bits to return all subsets. I don’t think you want to save this: it is about 2.68E12 bytes. If you insist on using characters, you'll probably be in tens of terabytes.

Of course, it is possible to buy a system that can support this, but is not very cost effective.

At the meta level, it is very likely that @JD said that this is not the way you really need to go. I recommend posting a new question and maybe you can clarify it here or on the SE website dedicated to statistics.

+3
source

You can try using expand.grid .

Create a data frame from all combinations of provided vectors or factors. See the description of the return value for exact details on how to do this.

0
source

All Articles