Add a countdown column to the data.table containing the rows until a special row is found

I have data.table with ordered data, and I want to add a column that tells me how many records until I get to the โ€œspecialโ€ record, which resets the countdown.

For instance:

 DT = data.table(idx = c(1,3,3,4,6,7,7,8,9), name = c("a", "a", "a", "b", "a", "a", "b", "a", "b")) setkey(DT, idx) #manually add the answer DT[, countdown := c(3,2,1,0,2,1,0,1,0)] 

Gives

 > DT idx name countdown 1: 1 a 3 2: 3 a 2 3: 3 a 1 4: 4 b 0 5: 6 a 2 6: 7 a 1 7: 7 b 0 8: 8 a 1 9: 9 b 0 

See how the countdown column tells me how many lines are up to the line named "b". The question is how to create this column in the code.

Please note that the key is not evenly distributed and may contain duplicates (therefore, this is not very useful in solving the problem). In general, non-b names may be different, but I could add a dummy column, which is just True / False if this requires a solution.

+6
source share
3 answers

Here is another idea:

 ## Create groups that end at each occurrence of "b" DT[, cd:=0L] DT[name=="b", cd:=1L] DT[, cd:=rev(cumsum(rev(cd)))] ## Count down within them DT[, cd:=max(.I) - .I, by=cd] # idx name cd # 1: 1 a 3 # 2: 3 a 2 # 3: 3 a 1 # 4: 4 b 0 # 5: 6 a 2 # 6: 7 a 1 # 7: 7 b 0 # 8: 8 a 1 # 9: 9 b 0 
+7
source

I am sure (or at least hope) that a pure "data.table" solution will be created, but at the same time you can use rle . In this case, you need to reverse the countdown, so we will use rev to change the "name" values โ€‹โ€‹before continuing.

 output <- sequence(rle(rev(DT$name))$lengths) makezero <- cumsum(rle(rev(DT$name))$lengths)[c(TRUE, FALSE)] output[makezero] <- 0 DT[, countdown := rev(output)] DT # idx name countdown # 1: 1 a 3 # 2: 3 a 2 # 3: 3 a 1 # 4: 4 b 0 # 5: 6 a 2 # 6: 7 a 1 # 7: 7 b 0 # 8: 8 a 1 # 9: 9 b 0 
+6
source

Here's a mixture of Josh and Ananda's solution, in this I use RLE to create the way Josh answered:

 t <- rle(DT$name) t <- t$lengths[t$values == "a"] DT[, cd := rep(t, t+1)] DT[, cd:=max(.I) - .I, by=cd] 

Even better: taking advantage of the fact that only one b always (or assuming here), you can do it better:

 t <- rle(DT$name) t <- t$lengths[t$values == "a"] DT[, cd := rev(sequence(rev(t+1)))-1] 

Edit: It can be seen from the OP comment that there are more than 1 b , and in such cases all b must be 0. The first step in this is to create groups where b ends after each consecutive a .

 DT <- data.table(idx=sample(10), name=c("a","a","a","b","b","a","a","b","a","b")) t <- rle(DT$name) val <- cumsum(t$lengths)[t$values == "b"] DT[, grp := rep(seq(val), c(val[1], diff(val)))] DT[, val := c(rev(seq_len(sum(name == "a"))), rep(0, sum(name == "b"))), by = grp] # idx name grp val # 1: 1 a 1 3 # 2: 7 a 1 2 # 3: 9 a 1 1 # 4: 4 b 1 0 # 5: 2 b 1 0 # 6: 8 a 2 2 # 7: 6 a 2 1 # 8: 3 b 2 0 # 9: 10 a 3 1 # 10: 5 b 3 0 
+3
source

All Articles