Join and add columns at a time

Background

I am new to the library data.tableand am currently participating in its effective use. What I have are two tables, and first I want to aggregate the second, and then combine it with the first and change the column in the joined table. Ideal (and for my understanding) at a time.

Package version

sessionInfo()
# R version 3.1.0 (2014-04-10)
# Platform: i386-w64-mingw32/i386 (32-bit)

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.4

# loaded via a namespace (and not attached):
# [1] chron_2.3-45  plyr_1.8.1    Rcpp_0.11.2   reshape2_1.4  stringr_0.6.2
# [6] tools_3.1.0    

the code

What I tried can be seen in this minimal example:

library(data.table)
set.seed(1)
DT1 <- data.table(id = LETTERS[1:4], x = rnorm(4), key = "id")
DT2 <- data.table(id = rep(LETTERS[1:4], each = 3), y = 1:12, z = rep(1, 12), key = "id")
DT1[DT2[, lapply(.SD, mean), by = "id"]] # simple join works fine
#    id         x  y z
# 1:  A -0.6264538  2 1
# 2:  B  0.1836433  5 1
# 3:  C -0.8356286  8 1
# 4:  D  1.5952808 11 1

# however, adding a 'j' argument does not work
DT1[DT2[, lapply(.SD, mean), by = "id"], x := -x] # (1)

# in fact the above statement changes the 'x' column in 'DT1':
DT1
#    id          x
# 1:  A  0.6264538
# 2:  B -0.1836433
# 3:  C  0.8356286
# 4:  D -1.5952808  

I suppose this has something to do with the smart way how it data.tableprocesses data (and does not make copies if necessary, so it calls by reference). Therefore, the following code works:

DT3 <- copy(DT1[DT2[, lapply(.SD, mean), by = "id"]])[, x := -x]
(DT4 <- DT1[DT2[, lapply(.SD, mean), by = "id"]][, x := -x]) # (2)
#    id          x  y z
# 1:  A -0.6264538  2 1
# 2:  B  0.1836433  5 1
# 3:  C -0.8356286  8 1
# 4:  D  1.5952808 11 1
identical(DT3, DT4)
# [1] TRUE

Questions

  • What is the best way to do this? "Best" in terms of time and memory?
  • ? , , , ( ) ?
  • (1) , (2) , ?
+4
2

(1)

DT1[DT2[, lapply(.SD, mean), by = "id"], x := -x] # (1)

, DT1 x:=-x, DT2[,...] .

(4)

 DT3 <- DT1[DT2[, lapply(.SD, mean), by = "id"]][, x := -x]

[ , x:=-x .

, .

+3

dplyr:

library("dplyr")

set.seed(1)
DT1 <- data_frame(id = LETTERS[1:4], x = rnorm(4), key = "id")
DT2 <- data_frame(id = rep(LETTERS[1:4], each = 3), y = 1:12, z = rep(1, 12), key = "id")

DT2 %>% 
  group_by(id) %>% 
  summarise_each(funs(mean), y:z) %>%
  left_join(DT1) %>% 
  mutate(x = -x)

(, data.table)

+2

All Articles