R: Tables and inserts with data.table

I am trying to take a very large set of records with several indexes, calculate the aggregate statistics for groups defined by a subset of indexes, and then insert them into each row of the table. The problem here is that these are very large tables - 10 M rows each.

Code for reproducing data below.

The basic idea is that there is a set of indices, for example ix1, ix2, ix3, ..., ixK. In general, I select only a couple of them, say, ix1 and ix2. Then I compute the aggregation of all rows with the corresponding values ​​ix1 and ix2 (for all displayed combinations) for a column named val . To keep it simple, I will focus on the amount.

I tried the following methods

  • Via sparse matrices: convert the values ​​to a list of coordinates, i.e. (ix1, ix2, val), then create a sparse matrix - this sums everything up nicely, and then I only need to convert back from the sparse matrix representation to a coordinate list. Speed: good, but it does more than necessary and does not generalize to higher sizes (for example, ix1, ix2, ix3) or more general functions than the sum.

  • Using lapply and split : by creating a new index unique to all (ix1, ix2, ...) n-tuples, I can use split and apply. The bad news is that the unique index is converted split to coefficient, and this conversion is very time-consuming. Try system({zz <- as.factor(1:10^7)}) .

  • Now I am trying to execute data.table with a command like sumDT <- DT[,sum(val),by = c("ix1","ix2")] . However, I still do not see how I can combine sumDT with DT , except through DT2 <- merge(DT, sumDT, by = c("ix1","ix2"))

Is there a faster method for combining this .table data than through the merge operation I described?

[I also tried bigsplit from the bigtabulate package and some other methods. Everything that converts to a factor is quite a lot - as far as I can tell, the conversion process is very slow.]


Code for generating data. Naturally, it is better to try a smaller N to see if something works, but not all methods scale very well for N β†’ 1000.

 N <- 10^7 set.seed(2011) ix1 <- 1 + floor(rexp(N, 0.01)) ix2 <- 1 + floor(rexp(N, 0.01)) ix3 <- 1 + floor(rexp(N, 0.01)) val <- runif(N) DF <- data.frame(ix1 = ix1, ix2 = ix2, ix3 = ix3, val = val) DF <- DF[order(DF[,1],DF[,2],DF[,3]),] DT <- as.data.table(DF) 
+4
source share
1 answer

Well, you may find that merging is not so bad as long as your key is set correctly.

Reinstall the problem:

 N <- 10^6 ## not 10^7 because RAM is tight right now set.seed(2011) ix1 <- 1 + floor(rexp(N, 0.01)) ix2 <- 1 + floor(rexp(N, 0.01)) ix3 <- 1 + floor(rexp(N, 0.01)) val <- runif(N) DT <- data.table(ix1=ix1, ix2=ix2, ix3=ix3, val=val, key=c("ix1", "ix2")) 

Now you can calculate summary statistics

 info <- DT[, list(summary=sum(val)), by=key(DT)] 

And combine the columns "path data.table" or just with merge

 m1 <- DT[info] ## the data.table way m2 <- merge(DT, info) ## if you're just used to merge identical(m1, m2) [1] TRUE 

If any of the merging methods is too slow, you can try the tricky way to build info from memory:

 info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)] m3 <- transform(DT, summary=info2$summary) identical(m1, m3) [1] TRUE 

Now let's see the time:

 ####################################################################### ## Using data.table[ ... ] or merge system.time(info <- DT[, list(summary=sum(val)), by=key(DT)]) user system elapsed 0.203 0.024 0.232 system.time(DT[info]) user system elapsed 0.217 0.078 0.296 system.time(merge(DT, info)) user system elapsed 0.981 0.202 1.185 ######################################################################## ## Now the two parts of the last version done separately: system.time(info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)]) user system elapsed 0.574 0.040 0.616 system.time(transform(DT, summary=info2$summary)) user system elapsed 0.173 0.093 0.267 

Or you can skip the intermediate construction of the info table if the following does not seem too incomprehensible for your tastes:

 system.time(m5 <- DT[ DT[, list(summary=sum(val)), by=key(DT)] ]) user system elapsed 0.424 0.101 0.525 identical(m5, m1) # [1] TRUE 
+4
source

All Articles