Creating a new column r data.table based on values in another column and grouping

Question

Creating a new column r data.table based on values in another column and grouping

I have data.table with dates, zipcode and purchase amounts.

 library(data.table) set.seed(88) DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)), zip = sample(c("2000", "1150", "3000"),10, replace = TRUE), purchaseAmount = sample(1:20, 10))

This creates the following:

  date zip purchaseAmount 1: 2016-01-08 1150 5 2: 2016-01-15 3000 15 3: 2016-02-15 1150 16 4: 2016-02-20 2000 18 5: 2016-03-07 2000 19 6: 2016-03-15 2000 11 7: 2016-03-17 2000 6 8: 2016-04-02 1150 17 9: 2016-04-08 3000 7 10: 2016-04-09 3000 20

I would like to add a fourth column earlierPurchases . This column should sum all the values in purchaseAmount for the previous x date within the zipcode .

EDIT: As suggested by Frank, here is the expected result:

  date zip purchaseAmount new_col 1: 2016-01-08 1150 5 5 2: 2016-01-15 3000 15 15 3: 2016-02-15 1150 16 16 4: 2016-02-20 2000 18 18 5: 2016-03-07 2000 19 19 6: 2016-03-15 2000 11 30 7: 2016-03-17 2000 6 36 8: 2016-04-02 1150 17 17 9: 2016-04-08 3000 7 7 10: 2016-04-09 3000 20 27

Is there a way for data.table to do this, or just write a function loop?

+7

r data.table

Mantelimies Jan 03 '17 at 19:00

source share

2 answers

I did not find any data.table solutions, here is how I understood it:

 library(dplyr) earlierPurchases <- vector() for(i in 1:nrow(DT)) { temp <- dplyr::filter(DT, zip == zip[i] & date < date[i]) earlierPurchases[i] <- sum(temp$purchaseAmount) } DT <- cbind(DT, earlierPurchases)

It worked pretty fast.

-one

Derek corcoran Jan 03 '17 at 19:13

source share

Frank · Accepted Answer · 2017-01-03T19:23:27+0000

It works:

 DT[, new_col := DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), sum(purchaseAmount) , by=.EACHI ]$V1 ] date zip purchaseAmount new_col 1: 2016-01-08 1150 5 5 2: 2016-01-15 3000 15 15 3: 2016-02-15 1150 16 16 4: 2016-02-20 2000 18 18 5: 2016-03-07 2000 19 19 6: 2016-03-15 2000 11 30 7: 2016-03-17 2000 6 36 8: 2016-04-02 1150 17 17 9: 2016-04-08 3000 7 7 10: 2016-04-09 3000 20 27

In this case, a “non equi” join is used, effectively taking each line; Find all rows matching our criteria in the on= expression for each row; and then summation over the line ( by=.EACHI ). In this case, joining without equivalence is probably less effective than any moving-sum approach.

How it works.

To add columns to the data.table, the usual syntax is DT[, new_col := expression] . Here the expression really works even outside of DT[...] . Try to launch it yourself:

 DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), sum(purchaseAmount) , by=.EACHI ]$V1

You can gradually simplify this until he just joins ...

 DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), sum(purchaseAmount) , by=.EACHI ] # note that V1 is the default name for computed columns DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1)] # now we're down to just the join

The connection syntax is similar to x[i, on=.(xcol = icol, xcol2 < icol2)] , as described on the document page that opens when you enter ?data.table into the R console with the data.table package loaded.

To get started with data.table, I would suggest looking at vignettes . After that, it is likely to be much more picky.

Creating a new column r data.table based on values ​​in another column and grouping

More articles:

Creating a new column r data.table based on values in another column and grouping