Total amounts for run lengths. Can this cycle be vectorized?

I have a data frame on which I compute the length encoding for a specific column. The values ​​of the dir column are either -1, 0, or 1.

dir.rle <- rle(df$dir)

Then I take the execution lengths and compute the segmented aggregate amounts from another column in the data frame. I am using a for loop, but I feel that there should be a way to do this more intelligently.

 ndx <- 1 for(i in 1:length(dir.rle$lengths)) { l <- dir.rle$lengths[i] - 1 s <- ndx e <- ndx+l tmp[s:e,]$cumval <- cumsum(df[s:e,]$val) ndx <- e + 1 } 

The length of the dir run determines the start, s and end, e for each run. The above code works, but it doesn't look like idiomatic R code. I feel that there must be another way to do this without a loop.

+4
source share
3 answers

This can be divided into two steps. First, if we create an rle based indexing column, we can use it to group and run cumsum . Then the group can be performed by any number of aggregation methods. I will show two options: one uses data.table and the other plyr .

 library(data.table) library(plyr) #data.table is the same thing as a data.frame for most purposes #Fake data dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20)) dir.rle <- rle(dat$dir) #Compute an indexing column to group by dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths)) #What does the indexer column look like? > head(dat) dir value indexer [1,] 1 0.5045807 1 [2,] 0 0.2660617 2 [3,] 1 1.0369641 3 [4,] 1 -0.4514342 3 [5,] -1 -0.3968631 4 [6,] -1 -2.1517093 4 #data.table approach dat[, cumsum(value), by = indexer] #plyr approach ddply(dat, "indexer", summarize, V1 = cumsum(value)) 
+12
source

Both Spacedman and Chase make the key point that the grouping variable simplifies things (and Chase outlines two great ways to go from there).

I will simply add an alternative approach to the formation of this grouping variable. It does not use rle and, at least for me, feels more intuitive. Basically, at every point where diff() detects a change in value, the cumsum that forms your grouping variable is incremented by one:

 df$group <- c(0, cumsum(!(diff(df$dir)==0))) # Or, equivalently df$group <- c(0, cumsum(as.logical(diff(df$dir)))) 
+4
source

Add a group column to the data frame. Sort of:

 df=data.frame(z=rnorm(100)) # dummy data df$dir = sign(df$z) # dummy +/- 1 rl = rle(df$dir) df$group = rep(1:length(rl$lengths),times=rl$lengths) 

then use tapply to summarize within groups:

 tapply(df$z,df$group,sum) 
+2
source

All Articles