Does the by () function execute a growing list

Does the by function create a list that grows one item at a time?

I need to process a data frame with approximately 4 M observations grouped by a column of factors. The situation is similar to the example below:

 > # Make 4M rows of data > x = data.frame(col1=1:4000000, col2=10000001:14000000) > # Make a factor > x[,"f"] = x[,"col1"] - x[,"col1"] %% 5 > > head(x) col1 col2 f 1 1 10000001 0 2 2 10000002 0 3 3 10000003 0 4 4 10000004 0 5 5 10000005 5 6 6 10000006 5 

Now a tapply on one of the columns takes a reasonable amount of time:

 > t1 = Sys.time() > z = tapply(x[, 1], x[, "f"], mean) > Sys.time() - t1 Time difference of 22.14491 secs 

But if I do this:

 z = by(x[, 1], x[, "f"], mean) 

It doesn't end anywhere almost simultaneously (I gave up in a minute).

Of course, in the above example, tapply can be used, but I really need to handle multiple columns together. What is the best way to do this?

+7
source share
2 answers

by slower than tapply because it wraps around by . Let's take a look at some guidelines: tapply in this situation is more than 3 times faster than when using by

UPDATED to include @Roland excellent recommendation:

 library(rbenchmark) library(data.table) dt <- data.table(x,key="f") using.tapply <- quote(tapply(x[, 1], x[, "f"], mean)) using.by <- quote(by(x[, 1], x[, "f"], mean)) using.dtable <- quote(dt[,mean(col1),by=key(dt)]) times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative") times[,c("test", "elapsed", "relative")] #------------------------# # RESULTS # #------------------------# # COMPARING tapply VS by # #----------------------------------- # test elapsed relative # 1 using.tapply 2.453 1.000 # 2 using.by 8.889 3.624 # COMPARING data.table VS tapply VS by # #------------------------------------------# # test elapsed relative # 2 using.dtable 0.168 1.000 # 1 using.tapply 2.396 14.262 # 3 using.by 8.566 50.988 

If x $ f is a factor, the loss of efficiency between tapply and by is even greater!

Although, note that both of them improve compared to non-factor inputs, while data.table remains approximately the same or worse

 x[, "f"] <- as.factor(x[, "f"]) dt <- data.table(x,key="f") times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative") times[,c("test", "elapsed", "relative")] # test elapsed relative # 2 using.dtable 0.175 1.000 # 1 using.tapply 1.803 10.303 # 3 using.by 7.854 44.880 



How and why, the short answer is in the documentation itself.

?by :

Description

The by function is an object-oriented wrapper applied to data frames.

look at the source for by (or more specifically, by.data.frame ):

 by.data.frame function (data, INDICES, FUN, ..., simplify = TRUE) { if (!is.list(INDICES)) { IND <- vector("list", 1L) IND[[1L]] <- INDICES names(IND) <- deparse(substitute(INDICES))[1L] } else IND <- INDICES FUNx <- function(x) FUN(data[x, , drop = FALSE], ...) nd <- nrow(data) ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)), data) attr(ans, "call") <- match.call() class(ans) <- "by" ans } 

We immediately see that there is still a tapply call plus many additional functions (including calls to deparse(substitute(.)) And eval(substitute(.)) , Which are relatively slow). Therefore, it makes sense that your tapply will be relatively faster than a similar by call.

+4
source

As for the best way to do this: with 4M rows, you should use data.table .

 library(data.table) dt <- data.table(x,key="f") dt[,mean(col1),by=key(dt)] dt[,list(mean1=mean(col1),mean2=mean(col2)),by=key(dt)] dt[,lapply(.SD,mean),by=key(dt)] 
+3
source

All Articles