I have a large data set (more than 20 million obs), which I analyze using the survey package, and it takes me time to run simple queries. I tried to find a way to speed up my code, but I would like to know if there are more efficient ways to make this more efficient.
In my test, I compare the speed of three teams using svyby / svytotal :
- Simple
svyby / svytotal - Parallel computing with
foreach dopar using 7 cores - Compiled Version 2 Option
Spoiler: Option 3 is more than twice as fast as the first option, but it is not suitable for large data sets, since it relies on parallel computations that quickly fall into memory limits when working with large data sets. I also encounter this problem despite 16 GB of RAM. There are several solutions to limit this memory , but none of them apply to the design of the shooting.
Any ideas on how to do this faster and not crash due to memory limitations?
My code with a reproducible example:
# Load Packages library(survey) library(data.table) library(compiler) library(foreach) library(doParallel) options(digits=3)
1) Simple code
t1 <- Sys.time() table1 <- svyby(~Vcount, ~stype+dnum+cname, design = dclus1, svytotal) T1 <- Sys.time() - t1
2) Parallel computing using foreach dopar using 7 cores
3. Compiled version of option 2
# make a function of the previous query query2 <- function (list_subsets) { foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% { svyby(~Vcount, ~stype+dnum+cname, design = i, svytotal)}}
results
>T1: 1.9 secs >T2: 1.13 secs >T3 0.58 secs barplot(c(T1, T2, T3), names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"), ylab="Seconds")
