I am trying to take a very large set of records with several indexes, calculate the aggregate statistics for groups defined by a subset of indexes, and then insert them into each row of the table. The problem here is that these are very large tables - 10 M rows each.
Code for reproducing data below.
The basic idea is that there is a set of indices, for example ix1, ix2, ix3, ..., ixK. In general, I select only a couple of them, say, ix1 and ix2. Then I compute the aggregation of all rows with the corresponding values ββix1 and ix2 (for all displayed combinations) for a column named val . To keep it simple, I will focus on the amount.
I tried the following methods
Via sparse matrices: convert the values ββto a list of coordinates, i.e. (ix1, ix2, val), then create a sparse matrix - this sums everything up nicely, and then I only need to convert back from the sparse matrix representation to a coordinate list. Speed: good, but it does more than necessary and does not generalize to higher sizes (for example, ix1, ix2, ix3) or more general functions than the sum.
Using lapply and split : by creating a new index unique to all (ix1, ix2, ...) n-tuples, I can use split and apply. The bad news is that the unique index is converted split to coefficient, and this conversion is very time-consuming. Try system({zz <- as.factor(1:10^7)}) .
Now I am trying to execute data.table with a command like sumDT <- DT[,sum(val),by = c("ix1","ix2")] . However, I still do not see how I can combine sumDT with DT , except through DT2 <- merge(DT, sumDT, by = c("ix1","ix2"))
Is there a faster method for combining this .table data than through the merge operation I described?
[I also tried bigsplit from the bigtabulate package and some other methods. Everything that converts to a factor is quite a lot - as far as I can tell, the conversion process is very slow.]
Code for generating data. Naturally, it is better to try a smaller N to see if something works, but not all methods scale very well for N β 1000.
N <- 10^7 set.seed(2011) ix1 <- 1 + floor(rexp(N, 0.01)) ix2 <- 1 + floor(rexp(N, 0.01)) ix3 <- 1 + floor(rexp(N, 0.01)) val <- runif(N) DF <- data.frame(ix1 = ix1, ix2 = ix2, ix3 = ix3, val = val) DF <- DF[order(DF[,1],DF[,2],DF[,3]),] DT <- as.data.table(DF)