Note: message was edited by changing the calculated function from rowSums to colSums (using lapply in the case of data.table)
I do not think that you could get the result faster than data.table . Here is the reference between plyr and data.table . Of course, if the time-consuming part is your function, you can use doMC to work in parallel using plyr (assuming you have many cores or you work in a cluster). Otherwise, I stick to data.table . Here's an analysis with huge test data and a dummy function:
# create a huge data.frame with repeating id values len <- 1e5 reps <- sample(1:20, len, replace = TRUE) x <- data.frame(id = rep(1:len, reps)) x <- transform(x, v1 = rnorm(nrow(x)), v2 = rnorm(nrow(x))) > nrow(x) [1] 1048534 # 1 million rows # construct functions for data.table and plyr # method 1 # using data.table DATA.TABLE <- function() { require(data.table) x.dt <- data.table(x, key="id") x.dt.out <- x.dt[, lapply(.SD, sum), by=id] } # method 2 # using plyr PLYR <- function() { require(plyr) x.plyr.out <- ddply(x, .(id), colSums) } # let benchmark > require(rbenchmark) > benchmark(DATA.TABLE(), PLYR(), order = "elapsed", replications = 1)[1:5] test replications elapsed relative user.self 1 DATA.TABLE() 1 1.006 1.00 .992 2 PLYR() 1 67.755 67.351 67.688
In a data.frame with 1 million rows, data.table takes data.table 0.992 seconds . The acceleration using data.table compared to plyr (by all accounts, when calculating column sums) is 68x . Depending on the computation time in your function, this acceleration will differ. But data.table will be even faster. plyr - split-apply-comb strategy. I do not think that you will get comparable acceleration compared to using the base to separate, apply and unify yourself. Of course you can try.
I ran the code with 10 million lines. data.table ran 5.893 seconds. plyr took 6300 seconds.
source share