Increase query speed using integrated survey design in R

I have a large data set (more than 20 million obs), which I analyze using the survey package, and it takes me time to run simple queries. I tried to find a way to speed up my code, but I would like to know if there are more efficient ways to make this more efficient.

In my test, I compare the speed of three teams using svyby / svytotal :

  • Simple svyby / svytotal
  • Parallel computing with foreach dopar using 7 cores
  • Compiled Version 2 Option

Spoiler: Option 3 is more than twice as fast as the first option, but it is not suitable for large data sets, since it relies on parallel computations that quickly fall into memory limits when working with large data sets. I also encounter this problem despite 16 GB of RAM. There are several solutions to limit this memory , but none of them apply to the design of the shooting.

Any ideas on how to do this faster and not crash due to memory limitations?

My code with a reproducible example:

 # Load Packages library(survey) library(data.table) library(compiler) library(foreach) library(doParallel) options(digits=3) # Load Data data(api) # Convert data to data.table format (mostly to increase speed of the process) apiclus1 <- as.data.table(apiclus1) # Multiplicate data observations by 1000 apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ] # create a count variable apiclus1[, Vcount := 1] # create survey design dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) 

1) Simple code

 t1 <- Sys.time() table1 <- svyby(~Vcount, ~stype+dnum+cname, design = dclus1, svytotal) T1 <- Sys.time() - t1 

2) Parallel computing using foreach dopar using 7 cores

 # in this option, I create a list with different subsets of the survey design # that will be passed to different CPU cores to work at the same time subdesign <- function(i){ subset(dclus1, dnum==i)} groups <- unique(apiclus1$dnum) list_subsets <- lapply(groups[], subdesign) # apply function and get all subsets in a list i <- NULL # Start Parallel registerDoParallel(cores=7) t2 <- Sys.time() table2 <- foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% { options( survey.lonely.psu = "remove" ) svyby(~Vcount, ~stype+dnum+cname, design = i, svytotal)} T2 <- Sys.time() - t2 

3. Compiled version of option 2

 # make a function of the previous query query2 <- function (list_subsets) { foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% { svyby(~Vcount, ~stype+dnum+cname, design = i, svytotal)}} # Compile the function to increase speed query3 <- cmpfun(query2 ) t3 <- Sys.time() table3 <- query3 (list_subsets) T3 <- Sys.time() - t3 

results

 >T1: 1.9 secs >T2: 1.13 secs >T3 0.58 secs barplot(c(T1, T2, T3), names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"), ylab="Seconds") 

enter image description here

+5
source share
1 answer

Thank you for posting this question so well. effectively working with large survey datasets in R probably requires some basic SQL syntax (which is much easier to learn than R). MonetDB is the only big data option compatible with the survey package, exploring other high-performance packages (probably) will not be fruitful. usually, when I explore a huge data set, I write directly in SQL queries, and not using a poll package, because standard error calculations are computationally intensive (and variances are not so useful for interactive data mining). notice how the final SQL timestamp resets all other parameters. to calculate a fast weighted value use something like "SELECT by_column , SUM( your_column * the_weight ) / SUM( the_weight ) FROM yourdata GROUP BY by_column"

when you need standard errors in interactive mode, linearization ( svydesign ) is often more complicated than replication ( svrepdesign ), but sometimes creating replication (as I did with jk1w_dclus1 below) requires more familiar with search methods than some users.

 # Load Packages library(MonetDB.R) library(MonetDBLite) library(DBI) # suggested in comments and needed on OSX library(survey) # Load Data data(api) # Multiplicate data observations by 10000 apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ] # create a count variable apiclus1$vcount <- 1 # create survey design dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) dbfolder <- tempdir() db <- dbConnect( MonetDBLite() , dbfolder ) dbWriteTable( db , 'apiclus1' , apiclus1 ) db_dclus1 <- svydesign( weight = ~pw , id = ~dnum , data = "apiclus1" , dbtype = "MonetDBLite" , dbname = dbfolder , fpc = ~fpc ) # you provided a design without strata, # so type="JK1" matches that most closely. # but see survey:::as.svrepdesign for other linearization-to-replication options jk1w <- jk1weights( psu = apiclus1$dnum , fpc = apiclus1$fpc ) # after the replicate-weights have been constructed, # here the `svrepdesign` call.. jk1w_dclus1 <- svrepdesign( weight = ~pw , type = "JK1" , repweights = jk1w$repweights , combined.weights = FALSE , scale = jk1w$scale , rscales = jk1w$rscales , data = 'apiclus1' , dbtype = "MonetDBLite" , dbname = dbfolder ) # slow system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal)) # > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal)) # user system elapsed # 17.40 2.86 20.27 # faster system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal)) # > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal)) # user system elapsed # 13.00 1.20 14.18 # fastest system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal)) # > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal)) # user system elapsed # 10.75 1.19 11.96 # same standard errors across the board all.equal( SE( res1 ) , SE( res2 ) ) all.equal( SE( res2 ) , SE( res3 ) ) # NOTE: the replicate-weighted design will be slightly different # for certain designs. however this technique is defensible # and gets used in # https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico # at the point you do not care about standard errors, # learn some sql: system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) ) # because this is near-instantaneous, no matter how much data you have. # same numbers as res1: all.equal( as.numeric( sort( coef( res1 ) ) ) , sort( res4$L1 ) ) # > system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) ) # user system elapsed # 0.15 0.20 0.23 
+6
source

All Articles