How to compare with values ​​adjacent in sequence in the same group

Let's say I have something like this:

set.seed(0) the.df <- data.frame( x=rep(letters[1:3], each=4), n=rep(0:3, 3), val=round(runif(12))) the.df xn val 1 a 0 1 2 a 1 0 3 a 2 0 4 a 3 1 5 b 0 1 6 b 1 0 7 b 2 1 8 b 3 1 9 c 0 1 10 c 1 1 11 c 2 0 12 c 3 0 

Inside each x , starting from n==2 (small to large), I want to set val to 0 if the previous val (in terms of n ) is 0; otherwise, leave it as is.

For example, in the subset x=="b" I first ignore the two lines, where n <2. Now on line 7, since the previous val is 0 ( the.df$val[the.df$x=="b" & the.df$n==1] ), I set val to 0 ( the.df$val[the.df$x=="b" & the.df$n==2] <- 0 ). Then on Row 8 now val for the previous n is 0 (we just set it), I also want to set val here to 0 ( the.df$val[the.df$x=="b" & the.df$n==3] <- 0 ).

Imagine data.frame not sorting. Therefore, order-specific procedures will require sorting. I also cannot assume that there are adjacent lines (for example, the line the.df[the.df$x=="a" & the.df$n==1, ] may be missing).

It seems that the hardest part is evaluating val in sequence. I can do this using a loop, but I think it will be inefficient (I have millions of lines). Is there a way I can do this more efficiently?

EDIT: desired output

 the.df xn val wanted 1 a 0 1 1 2 a 1 0 0 3 a 2 0 0 4 a 3 1 0 5 b 0 1 1 6 b 1 0 0 7 b 2 1 0 8 b 3 1 0 9 c 0 1 1 10 c 1 1 1 11 c 2 0 0 12 c 3 0 0 

In addition, I am not opposed to creating new columns (for example, placing the desired values).

+7
r
source share
4 answers

Using data.table , I will try the following

 library(data.table) setDT(the.df)[order(n), val := if(length(indx <- which(val[2:.N] == 0L))) c(val[1:(indx[1L] + 1L)], rep(0L, .N - (indx[1L] + 1L))), by = x] the.df # xn val # 1: a 0 1 # 2: a 1 0 # 3: a 2 0 # 4: a 3 0 # 5: b 0 1 # 6: b 1 0 # 7: b 2 0 # 8: b 3 0 # 9: c 0 1 # 10: c 1 1 # 11: c 2 0 # 12: c 3 0 

This will simultaneously order the data by n (as you said, it is not ordered in real life) and recreate val by condition (which means that if the condition is not met, val will be untouched).


Let's hope that in the near future this will be implemented, and then the code can be

 setDT(the.df)[order(n), val[n > 2] := if(val[2L] == 0) 0L, by = x] 

What could be a big improvement in performance and syntax wise

+6
source share

Base R approach can be

 df <- the.df[order(the.df$x, the.df$n),] df$val <- ave(df$val, df$x, FUN=fun) 

As for fun , @DavidArenburg's answer on plain R and written more poetically could be

 fun0 <- function(v) { idx <- which.max(v[2:length(v)] == 0L) + 1L if (length(idx)) v[idx:length(v)] <- 0L v } 

It seems like a good idea to first formulate the solution as an independent function, because then it is easy to verify. fun0 fails for some edge cases, e.g.

 > fun0(0) [1] 0 0 0 > fun0(1) [1] 0 0 0 > fun0(c(1, 1)) [1] 1 0 

Best version

 fun1 <- function(v) { tst <- tail(v, -1) == 0L if (any(tst)) { idx <- which.max(tst) + 1L v[idx:length(v)] <- 0L } v } 

And even better, following @Arun

 fun <- function(v) if (length(v) > 2) c(v[1], cummin(v[-1])) else v 

It is competitive (in order of magnitude) with the solution data.table, with ordering and return, occurring in less than 1 s for data in the range of ~ 10 m. Timing diagrams @ m-dz. For millions of lines per second, you should not continue further optimization.

However, when there is a very large number of small groups (for example, 2M of each of 5), the improvement is to avoid calling the tapply() function using the group identifier to compensate for the minimum. For example,

 df <- df[order(df$x, df$n),] grp <- match(df$x, unique(df$x)) # strictly sequential groups keep <- duplicated(grp) # ignore the first of each group df$val[keep] <- cummin(df$val[keep] - grp[keep]) + grp[keep] 
+3
source share

Hmmm, should be pretty effective if you switch to data.table ...

 library(data.table) # Define the.df as a data.table (or use data.table::setDT() function) set.seed(0) the.df <- data.table( x = rep(letters[1:3], each = 4), n = rep(0:3, 3), val = round(runif(12)) ) m_dz <- function() { setorder(the.df, x, n) repeat{ # Get IDs of rows to change # ids <- which(the.df[, (n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0)]) ids <- the.df[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE] # If no IDs break if(length(ids) == 0){ break } # Set val to 0 # for (i in ids) set(the.df, i = i, j = "val", value = 0) set(the.df, i = ids, j = "val", value = 0) } return(the.df) } 

Edit: the above function has been slightly modified thanks to @ jangorecki's, i.e. uses which = TRUE and set(the.df, i = ids, j = "val", value = 0) , which made the timings much more stable (not very high maximum timings).

Edit: comparing time with @David Arenburgs answer to a slightly larger table, m-dz() updated (@FoldedChromatin answer was skipped due to excellent results).

My function is a little faster in terms of median and upper quantiles, but in the timings there is a rather large spread (see max ...), I can not understand why. We hope that the synchronization methodology is correct (returning the result to another object, etc.).

Anything more will kill my computer :(

 set.seed(0) groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = "")) size1 <- length(unique(groups_ids)) size2 <- round(1e7/size1) the.df1 <- data.table( x = rep(groups_ids, each = size2), # 52 * 500 = 26000 n = rep(0:(size2-1), size1), val = round(runif(size1*size2)) ) the.df2 <- copy(the.df1) # m-dz m_dz <- function() { setorder(df1, x, n) repeat{ ids <- df1[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE] if(length(ids) == 0){ break } set(df1, i = ids, j = "val", value = 0) } return(df1) } # David Arenburg DavidArenburg <- function() { setorder(df2, x, n) df2[, val := if(length(indx <- which.max(val[2:.N] == 0) + 1L)) c(val[1:indx], rep(0L, .N - indx)), by = x] return(df2) } library(microbenchmark) microbenchmark( res1 <- m_dz(), res2 <- DavidArenburg(), times = 100 ) # Unit: milliseconds # expr min lq mean median uq max neval cld # res1 <- m_dz() 247.4136 268.5005 363.0117 288.4216 312.7307 7071.0960 100 a # res2 <- DavidArenburg() 270.6074 281.3935 314.7864 303.5229 328.1210 525.8095 100 a identical(res1, res2) # [1] TRUE 

Edit: (old) results for an even bigger table:

 set.seed(0) groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = "")) size1 <- length(unique(groups_ids)) size2 <- round(1e8/size1) # Unit: seconds # expr min lq mean median uq max neval cld # res1 <- m_dz() 5.599855 5.800264 8.773817 5.923721 6.021132 289.85107 100 a # res2 <- m_dz2() 5.571911 5.836191 9.047958 5.970952 6.123419 310.65280 100 a # res3 <- DavidArenburg() 9.183145 9.519756 9.714105 9.723325 9.918377 10.28965 100 a 
+2
source share

Why not just use by

 > set.seed(0) > the.df <- data.frame( x=rep(letters[1:3], each=4), n=rep(0:3, 3), val=round(runif(12))) > the.df xn val 1 a 0 1 2 a 1 0 3 a 2 0 4 a 3 1 5 b 0 1 6 b 1 0 7 b 2 1 8 b 3 1 9 c 0 1 10 c 1 1 11 c 2 0 12 c 3 0 > Mod.df<-by(the.df,INDICES=the.df$x,function(x){ x$val[x$n==2]=0 Which=which(x$n==2 & x$val==0)+1 x$val[Which]=0 x}) > do.call(rbind,Mod.df) xn val a.1 a 0 1 a.2 a 1 0 a.3 a 2 0 a.4 a 3 0 b.5 b 0 1 b.6 b 1 0 b.7 b 2 0 b.8 b 3 0 c.9 c 0 1 c.10 c 1 1 c.11 c 2 0 c.12 c 3 0 
0
source share

All Articles