Dplyr: colSums in subgroup (group_by) data frames: elegant

I have a very large data frame (265,874 x 30) with three sensitive groups: age category (1-6), dates (5479 such) and geographic area (total 4). Each entry consists of a selection of each plus 27 counting variables. I want to group by each of the grouping variables, then take colSums for the resulting subgrouped 27 variables. I am trying to use dplyr (v0.2) to do this, because doing this manually completes the configuration of a lot of redundant things (or resorting to a loop to iterate over grouping parameters, due to the lack of an elegant solution).

Code example :

countData <- sample(0:10, 2000, replace = TRUE) dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE) locality <- sample(1:2, 2000, replace = TRUE) ageCat <- sample(1:2, 2000, replace = TRUE) sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10)) 

what i would like to do is ...

 library("dplyr") sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)])) 

but this doesn’t quite work, since the results from colSums () are not data frames. If I throw it, it works:

 sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10))) 

but the final do (...) bit seems very awkward.

Any thoughts on how to do this more elegantly or efficiently? I think the question boils down to: what is the best way to use the do () and function. statement to sum the data frame through colSums.

Note: the do (.) Operator applies only to dplyr 0.2, so you need to get it from GitHub ( link ), not from CRAN.

Edit: results from offers

Three solutions :

  • My suggestion in the message: 146.765 seconds have passed.

  • @joran sentence below: 6.902 seconds

  • @eddi suggestion in the comments using data.table: 6.715 seconds.

I did not replicate, I just used system.time () to get a rough calibration. In appearance, dplyr and data.table do pretty much the same thing in my dataset, and both are significantly faster when used correctly than the hacking solution I came up with yesterday.

+7
r dplyr
source share
1 answer

If something is missing for me, this seems to work for summarise_each (a kind of colwise analog from plyr ):

 sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum)) 

The grouping column is not included in the summation function by default, and you can only select a subset of columns to apply functions to using the same method as when using select .

( summarise_each in version 0.2 dplyr , but not in 0.1.3, as far as I know).

+8
source share

All Articles