Data.table with a sliding join that calculates the average condition for a date

Question

Data.table with a sliding join that calculates the average condition for a date

I could do this by going through my data set several times, but thought there should be a more efficient way to do this through data.table. Here's what the dataset looks like:

CaseID         Won     OwnerID      Time_period    Finished
  1            yes        A              1              no
  1            yes        A              3              no
  1            yes        A              5              yes
  2            no         A              4              no
  2            no         A              6              yes
  3            yes        A              2              yes
  4            no         A              3              yes
  5            15         B              2              no

For each row, by owner, I want to create an average number of cases completed before that period that are won.

CaseID         Won     OwnerID      Time_period     Finished     AvgWonByOwner  
  1            yes        A              1              no            NA
  1            yes        A              3              no             1
  1            yes        A              5              yes           .5
  2            no         A              4              no            .5
  2            no         A              6              yes           2/3
  3            yes        A              2              yes           NA
  4            no         A              3              yes           1
  5            15         B              2              no            NA

Looking at it in detail, it seems ridiculously complex. I thought you could do it with some kind of sliding merge, but I don’t know how to set the condition where the average value is calculated only from Won to the date of the line and where it should have the same owner identifier.

Edit 1: Explanation for the numbers in the last column

AvgWonByOwner          Explanation
   NA                  t = 1, No cases finished yet, this could be 0 too
   1                   t = 3, case 3 finished and is won, so average wins is 1
  .5                   t = 5, case 3 finished, won; case 4 finished lost; average = .5
  .5                   t = 4, case 3 finished, won; case 4 finished lost; average = .5
  2/3                  t = 6, case 3 finished, won, case 4 finished lost, case 1 finished won, average: 2/3
   NA                  t = 1, No cases finished yet, this could be 0 too
   1                   t = 3, case 3 finished and is won, so average wins is 1
   NA                  t = 1, No cases finished yet, this could be 0 too

+4

r data.table

Luke Nov 12 '13 at 21:34

source share

2

## Compute a data.table recording the win percentage at end of each time period
B <- dt[Finished=="yes",]
B[,winpct := (cumsum(Won=="yes")/seq_along(Won)),by=OwnerID]

## Shift forward by one time step, as per OP description of problem
B[,Time_period := Time_period + 1]
setkeyv(B, key(dt))

## Append win percentage column back to original data.table
cbind(dt, AvgWonByOwner=B[dt, winpct, roll=TRUE][["winpct"]])
#    CaseID Won OwnerID Time_period Finished AvgWonByOwner
# 1:      1 yes       A           1       no            NA
# 2:      3 yes       A           2      yes            NA
# 3:      1 yes       A           3       no     1.0000000
# 4:      4  no       A           3      yes     1.0000000
# 5:      2  no       A           4       no     0.5000000
# 6:      1 yes       A           5      yes     0.5000000
# 7:      2  no       A           6      yes     0.6666667
# 8:      5  15       B           2       no            NA

+2

Josh O'Brien 13 . '13 0:22

eddi · Accepted Answer · 2013-11-12T23:55:26+0000

dt = data.table(structure(list(CaseID = c(1, 1, 1, 2, 2, 3, 4, 5), Won = structure(c(3L, 
3L, 3L, 2L, 2L, 3L, 2L, 1L), .Label = c("15", "no", "yes"), class = "factor"), 
    OwnerID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", 
    "B"), class = "factor"), Time_period = c(1L, 3L, 5L, 4L, 
    6L, 2L, 3L, 2L), Finished = structure(c(1L, 1L, 2L, 1L, 2L, 
    2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), .Names = c("CaseID", 
"Won", "OwnerID", "Time_period", "Finished"), row.names = c(NA, 
-8L), class = c("data.table", "data.frame")))

# order
setkey(dt, OwnerID, Time_period)

# calculate the required ratio but including current time
dt[, ratio := cumsum(Finished == "yes" & Won == "yes") /
              cumsum(Finished == "yes"),
     by = list(OwnerID)]

# shift to satisfy the strict inequality as per OP
dt[, avgWon := c(NaN, ratio[-.N]), by = OwnerID]

# take the first one for each time (that is last one from previous time)
# so that all of the outcomes happening at same time are accounted for
dt[, avgWon := avgWon[1], by = key(dt)]

dt[order(OwnerID, CaseID)]
#   CaseID Won OwnerID Time_period Finished     ratio    avgWon
#1:      1 yes       A           1       no       NaN       NaN
#2:      1 yes       A           3       no 1.0000000 1.0000000
#3:      1 yes       A           5      yes 0.6666667 0.5000000
#4:      2  no       A           4       no 0.5000000 0.5000000
#5:      2  no       A           6      yes 0.5000000 0.6666667
#6:      3 yes       A           2      yes 1.0000000       NaN
#7:      4  no       A           3      yes 0.5000000 1.0000000
#8:      5  15       B           2       no       NaN       NaN

Data.table with a sliding join that calculates the average condition for a date

More articles: