Aggregation with data. table in R

The exercise is to combine the numeric vector of values ​​using a combination of factors with data.table in R. As an example, take the following data table:

require (data.table) require (plyr) dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27))) 

Note that each unique combination of "month" and "fac" appears three times. Therefore, when I try to average values ​​with these two factors, I should expect a data frame with 9 unique rows:

 (agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value))) month fac V1 1 Jan a -0.36030953 2 Jan b -0.58444588 3 Jan c -0.15472876 4 Feb a -0.05674483 5 Feb b 0.26415972 6 Feb c -1.62346772 7 Mar a 0.24560510 8 Mar b 0.82548140 9 Mar c 0.18721114 

However, when combined with data.table, I continue to get the results provided by each redundant combination of two factors:

 (agg2 <- dtb[, value := mean (value), by = list (month, fac)]) month fac value 1: Jan a -0.36030953 2: Jan a -0.36030953 3: Jan a -0.36030953 4: Feb a -0.05674483 5: Feb a -0.05674483 6: Feb a -0.05674483 7: Mar a 0.24560510 8: Mar a 0.24560510 9: Mar a 0.24560510 10: Jan b -0.58444588 11: Jan b -0.58444588 12: Jan b -0.58444588 13: Feb b 0.26415972 14: Feb b 0.26415972 15: Feb b 0.26415972 16: Mar b 0.82548140 17: Mar b 0.82548140 18: Mar b 0.82548140 19: Jan c -0.15472876 20: Jan c -0.15472876 21: Jan c -0.15472876 22: Feb c -1.62346772 23: Feb c -1.62346772 24: Feb c -1.62346772 25: Mar c 0.18721114 26: Mar c 0.18721114 27: Mar c 0.18721114 month fac value 

Is there an elegant way to reduce these results to a single row for a unique combination of factors with a data table?

+4
source share
2 answers

The problem (and reasoning) is related to the fact that the aggregated value is not easily assigned.

It’s easier to see this in action if you look at a data table with more columns than just the ones used for the calculation.

 # Therefore, let add a new column dtb[, newCol := LETTERS[seq(length(value))] 

Please note that if we just want to output the calculated value, then the expression on the RHS , as you have it, is just fine.

 # This gives the expected results dtb[, mean (value), by = list (month, fac)] # This on the other hand assigns the respective values to *each* row dtb[, value := mean (value), by = list (month, fac)] 

In other words, data is multiplied to return only unique values.
However, if you want to save this value back to the SAME data table (what happens when using the := operator), then all rows identified in i (all rows by defualt) are assigned a value. (which when you look at the output with extra columns makes sense)

Then copying this data.table to agg is still sent through all the rows.

Therefore, if you want to copy to a new table only those rows from your original table that are unique , you can

 a. wrap the original table inside `unique()` before assigning it b. assign the table, above, that is returned when you are not assigning the RHS output (which is what @Arun suggested) 

Example a. will be:

  agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)]) 

The following example may help illustrate.

(You will need to copy + paste this, as the output is omitted)

  # SAMPLE DATA, as above library(data.table) dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27)) # METHOD 1 # #------------# dtb <- copy(dtb.bak) # restore, from sample data. dtb[, value := mean (value), by = list (month, fac)] dtb # this is what you would like to assign unique(dtb) # METHOD 2 # #------------# dtb <- copy(dtb.bak) # restore, from sample data. # this is what you would like to assign # next two lines are the same, only differnce is column name dtb[, mean (value), by = list (month, fac)] dtb[, list("mean" = mean (value)), by = list (month, fac)] # quote marks added for clarity # dtb is unchanged. dtb # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)] dtb1 <- copy(dtb.bak) # restore, from sample data. dtb2 <- copy(dtb.bak) # restore, from sample data. # Method 1 dtb1[, value := mean (value), by = list (month, fac)] dtb1 unique(dtb1) # METHOD 2 # dtb2[, list("mean" = mean (value)), by = list (month, fac)] # quote marks added for clarity dtb2 # METHOD 2, WITH ADDED COLUMNS IN list() in `j` dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)] # quote marks added for clarity # notice this has more columns thatn unique(dtb1) 
+7
source

You should:

 agg2 <- dtb[, list(value = mean(value)), by = list (month, fac)] 

:= will return values ​​for RHS to match the number of elements in LHS . Do ?':=' To find out more about this.

+5
source

All Articles