How to speed up summation and ddply?

I have a data frame with 2 million rows and 15 columns. I want to group 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors) and get a weighted average of 3 columns (with weights determined by my data set). The following is fast enough:

system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean)) user system elapsed 91.358 4.747 115.727 

The problem is that I want to use weighted.mean instead of the average to calculate my aggregate columns.

If I try to execute the following ddply on the same data frame (note that I drop to immutable), the following does not end after 20 minutes:

 x <- ddply(idata.frame(aggdf), c("fac1","fac2","fac3"), summarise, w=sum(w), col1=weighted.mean(col1, w), col2=weighted.mean(col2, w), col3=weighted.mean(col3, w)) 

This operation, apparently, is a hungry processor, but not very intensive in RAM.

EDIT: Therefore, I wrote this small function, which β€œdeceives” a little, using some properties of the weighted average and performs multiplication and division by the whole object, and not by slices.

 weighted_mean_cols <- function(df, bycols, aggcols, weightcol) { df[,aggcols] <- df[,aggcols]*df[,weightcol] df <- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum) df[,aggcols] <- df[,aggcols]/df[,weightcol] df } 

When I run like:

 a2 <- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w") 

I get good performance and somewhat reusable, elegant code.

+7
source share
2 answers

If you are going to use your editing, why not use rowsum and save a few minutes on runtime?

 nr <- 2e6 nc <- 3 aggdf <- data.frame(matrix(rnorm(nr*nc),nr,nc), matrix(sample(100,nr*nc,TRUE),nr,nc), rnorm(nr)) colnames(aggdf) <- c("col1","col2","col3","fac1","fac2","fac3","w") system.time({ aggsums <- rowsum(data.frame(aggdf[,c("col1","col2","col3")]*aggdf$w,w=aggdf$w), interaction(aggdf[,c("fac1","fac2","fac3")])) agg_wtd_mean <- aggsums[,1:3]/aggsums[,4] }) # user system elapsed # 16.21 0.77 16.99 
+2
source

Although ddply hard to beat for code elegance and simplicity, I find that tapply runs much faster for big data. In your case, I would use

 do.call("cbind", list((w <- tapply(..)), tapply(..))) 

Sorry for the points and possibly a misunderstanding of the issue; but I'm in a bit of a hurry and have to drive the bus for about minus five minutes!

+5
source

All Articles