I had a problem when we figured out a way to do aggregation in dplyr in R, but for some reason could not come up with a solution (although I think it should be pretty easy).
I have a dataset as follows:
structure(list(date = structure(c(16431, 16431, 16431, 16432, 16432, 16432, 16433, 16433, 16433), class = "Date"), colour = structure(c(3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("blue", "green", "red"), class = "factor"), shape = structure(c(2L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 1L), .Label = c("circle", "square", "triangle" ), class = "factor"), value = c(100, 130, 100, 180, 125, 190, 120, 100, 140)), .Names = c("date", "colour", "shape", "value" ), row.names = c(NA, -9L), class = "data.frame")
which is as follows:
date colour shape value 1 2014-12-27 red square 100 2 2014-12-27 blue square 130 3 2014-12-27 blue triangle 100 4 2014-12-28 green triangle 180 5 2014-12-28 green triangle 125 6 2014-12-28 red square 190 7 2014-12-29 red circle 120 8 2014-12-29 blue circle 100 9 2014-12-29 blue circle 140
My goal is to calculate the most frequent color, shape and average value per day. My expected result is as follows:
date colour shape value 1 27/12/2014 blue square 110 2 28/12/2014 green triangle 165 3 29/12/2014 blue circle 120
I ended this with split and wrote my own function to calculate above for data.frame , and then used snow::clusterApply for parallel work. It was quite efficient (my original dataset was about 10 m long), but I am wondering if this can happen on the same chain using dplyr . Efficiency is really important for this, so the ability to run it in the same chain is very important.