Creating a very large matrix of string combinations using combn () and the bigmemory package

I have a vector x of 1,344 unique strings. I want to create a matrix that gives me all possible groups of three values, regardless of order, and exports them to csv.

I am running R on EC2 on an instance of m1.large with 64 bit Ubuntu. When using combn (x, 3), I get an error from memory:

Error: cannot allocate vector of size 9.0 Gb 

The size of the resulting matrix is ​​C1344.3 = 403,716,544 rows and three columns, which is a transposition of the result of the combn () function.

I thought about using the bigmemory package to create a file with big.matrix support, so I can assign the results to the combn () function. I can create a pre-allocated large matrix:

 library(bigmemory) x <- as.character(1:1344) combos <- 403716544 test <- filebacked.big.matrix(nrow = combos, ncol = 3, init = 0, backingfile = "test.matrix") 

But when I try to highlight the values ​​of test <- combn(x, 3) , I still get the same thing: Error: cannot allocate vector of size 9.0 Gb

I even tried to achieve the result of combn(x,3) , but I think that since the combn () function returns an error, the big.matrix function also does not work.

 test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") Error: cannot allocate vector of size 9.0 Gb Error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") : error in evaluating the argument 'x' in selecting a method for function 'as.big.matrix' 

Is there a way to combine these two functions together to get what I need? Are there other ways to achieve this? Thanks.

+6
r combinatorics bigdata
source share
3 answers

First you can find all the combinations of the two combinations, and then just combine them with the 3D value, saving them every time. This takes up a lot less memory:

 combn.mod <- function(x,fname){ tmp <- combn(x,2,simplify=F) n <- length(x) for ( i in x[-c(n,n-1)]){ # Drop all combinations that contain value i id <- which(!unlist(lapply(tmp,function(t) i %in% t))) tmp <- tmp[id] # add i to all other combinations and write to file out <- do.call(rbind,lapply(tmp,c,i)) write(t(out),file=fname,ncolumns=3,append=T,sep=",") } } combn.mod(x,"F:/Tmp/Test.txt") 

This is not as general as Joshua says, but it is specifically for your business. I think this is faster than in this particular case, but I did not make a comparison. The function works on my computer, using just over 50 MB (approximately) when applied to your x.

EDIT

On the side: if this is for simulation purposes, it’s hard for me to believe that 400+ million simulations are required for any scientific application. You can ask the right answer to the wrong question here ...

ACTION OF THE CONCEPT:

I changed the recording line to tt[[i]]<-out , added tt <- list() before the loop and returned (tt) after it. Then:

 > do.call(rbind,combn.mod(letters[1:5])) [,1] [,2] [,3] [1,] "b" "c" "a" [2,] "b" "d" "a" [3,] "b" "e" "a" [4,] "c" "d" "a" [5,] "c" "e" "a" [6,] "d" "e" "a" [7,] "c" "d" "b" [8,] "c" "e" "b" [9,] "d" "e" "b" [10,] "d" "e" "c" 
+3
source share

Here's the function I wrote in R, which currently finds its (unexposed) home in the LSPM package. You give it the total number of elements n , the number of elements to select r and the index of the combination you want i ; it returns values ​​in 1:n corresponding to the combination of i .

 ".combinadic" <- function(n, r, i) { # http://msdn.microsoft.com/en-us/library/aa289166(VS.71).aspx # http://en.wikipedia.org/wiki/Combinadic if(i < 1 | i > choose(n,r)) stop("'i' must be 0 < i <= n!/(nr)!") largestV <- function(n, r, i) { #v <- n-1 v <- n # Adjusted for one-based indexing #while(choose(v,r) > i) v <- v-1 while(choose(v,r) >= i) v <- v-1 # Adjusted for one-based indexing return(v) } res <- rep(NA,r) for(j in 1:r) { res[j] <- largestV(n,r,i) i <- i-choose(res[j],r) n <- res[j] r <- r-1 } res <- res + 1 return(res) } 

It allows you to create each combination based on the value of the lexicographic index:

 > .combinadic(1344, 3, 1) [1] 3 2 1 > .combinadic(1344, 3, 2) [1] 4 2 1 > .combinadic(1344, 3, 403716544) [1] 1344 1343 1342 

So, you just need to focus on 1: 403716544 and add the results to the file. This may take some time, but at least it is possible (see Dirk's answer). You may also need to do this in several cycles, since vector 1:403716544 will not fit into memory on my machine.

Or you can just port the R code to C / C ++ and do a loop / write there, as that will be much faster.

+5
source share

In a first approximation, each algorithm speeds up storage for speed.

You hit the border trying to redistribute a fully enumerable combinational matrix. Therefore, perhaps you should try not to redistribute this matrix, but try, say,

  • If you think you need combinations, calculate them somewhere else and save them in a simple db (or, heck, flat file) and look at them - 9 gb saved

  • Use open source code, read the code before combn() and change it to the client server: when called with index number N, it will loop and return the Nth element. Ineffective, but perhaps more affordable.

+1
source share

All Articles