I have a very large data frame (265,874 x 30) with three sensitive groups: age category (1-6), dates (5479 such) and geographic area (total 4). Each entry consists of a selection of each plus 27 counting variables. I want to group by each of the grouping variables, then take colSums for the resulting subgrouped 27 variables. I am trying to use dplyr (v0.2) to do this, because doing this manually completes the configuration of a lot of redundant things (or resorting to a loop to iterate over grouping parameters, due to the lack of an elegant solution).
Code example :
countData <- sample(0:10, 2000, replace = TRUE) dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE) locality <- sample(1:2, 2000, replace = TRUE) ageCat <- sample(1:2, 2000, replace = TRUE) sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))
what i would like to do is ...
library("dplyr") sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))
but this doesnβt quite work, since the results from colSums () are not data frames. If I throw it, it works:
sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))
but the final do (...) bit seems very awkward.
Any thoughts on how to do this more elegantly or efficiently? I think the question boils down to: what is the best way to use the do () and function. statement to sum the data frame through colSums.
Note: the do (.) Operator applies only to dplyr 0.2, so you need to get it from GitHub ( link ), not from CRAN.
Edit: results from offers
Three solutions :
My suggestion in the message: 146.765 seconds have passed.
@joran sentence below: 6.902 seconds
@eddi suggestion in the comments using data.table: 6.715 seconds.
I did not replicate, I just used system.time () to get a rough calibration. In appearance, dplyr and data.table do pretty much the same thing in my dataset, and both are significantly faster when used correctly than the hacking solution I came up with yesterday.