Clear the data frame R so that the column value of the row is more than 2 times the next value of the next row

I have a data frame illustrated by the following

dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4) id <- rep("A",length(dist)) df<-cbind.data.frame(id,dist) df id dist 1 A 1.1 2 A 1.0 3 A 10.0 4 A 5.0 5 A 2.1 6 A 12.2 7 A 3.3 8 A 3.4 

I need to clear it, so the row values ​​in the dist column are more than 2 times the value of the next row at any time. A cleared data frame will look like this:

  id dist 1 A 1.1 2 A 1.0 5 A 2.1 7 A 3.3 8 A 3.4 

I tried to make a function with a for loop and if the instruction to clear it

 cleaner <- function (df,dist,times_larger) { for (i in 1:(nrow(df)-1)) { if (df$dist[i] > df$dist[i+1]*times_larger){ df<-df[-i,] break } } df } 

Obviously, if I do not break the loop, it will create an error, because the number of lines in df will change in the process. If I manually started the cycle on df several times:

 df<-cleaner(df,"dist",2) 

he will be cleaned as I want.

I also tried various function constructors and applied them to a data frame using, but without luck.

Does anyone have a good suggestion on how to repeat a function in a data frame until it changes anymore, a better functional structure, or maybe a better way to clean it up?

Any suggestions are most appreciated

+7
r dataframe data-manipulation data-cleaning
source share
3 answers

You can try lead from dplyr

 library(dplyr) #dplyr_0.4.0 filter(df, dist < 2 * lead(dist, default = Inf)) # id dist #1 A 1.1 #2 A 1.0 #3 A 2.1 #4 A 3.3 #5 A 3.4 

Or using a similar method in data.table . The new shift function is introduced in the development version of data.table. We can specify the type lead . By default, this is lag and fill is NA. Change fill to "Inf" (inspired by @Marat Talipov's post).

 library(data.table) #data.table_1.9.5 setDT(df)[dist <2 *shift(dist,type='lead', fill=Inf)] # id dist #1: A 1.1 #2: A 1.0 #3: A 2.1 #4: A 3.3 #5: A 3.4 

Update

If the value of "dist" is "2" times greater than the next value, the above solutions remove this line. In such cases

 setDT(df)[dist <2 *(shift(dist,type='lead', fill=Inf)+.Machine$double.eps)] # id dist #1: A 1.1 #2: A 1.0 #3: A 2.1 #4: A 3.3 #5: A 3.4 

Using another example, commented by @Henrik.

 df1 <- data.frame(dist= as.numeric(3:1)) setDT(df1)[dist <2 *(shift(dist,type='lead', fill=Inf)+.Machine$double.eps)] # dist #1: 3 #2: 2 #3: 1 

Benchmarks

 set.seed(49) df <- data.frame(id='A', dist=rnorm(1e7,20)) df1 <- copy(df) akrun1 <- function() {filter(df, dist < 2 * lead(dist, default = Inf)) } akrun2 <- function() {setDT(df1)[dist <2 *shift(dist,type='lead', fill=Inf)]} marat <- function() {subset(df,dist < c(2*dist[-1],Inf))} Colonel <- function() {df[with(df, dist<2*c(dist[-1], tail(dist,1))),]} library(microbenchmark) microbenchmark(akrun1(), akrun2(), marat(), Colonel(), unit='relative', times=20L) #Unit: relative # expr min lq mean median uq max neval cld # akrun1() 2.029087 1.990739 1.864697 1.965247 1.773722 1.727474 20 b # akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a # marat() 8.032147 8.137982 7.359821 7.937062 7.134686 5.837623 20 d #Colonel() 7.094465 7.045000 6.473552 6.903460 6.197737 5.359575 20 c 
+5
source share

You can shift the dist column one element to the left, multiply it by two, and compare with the original dist :

 subset(df,dist < c(2*dist[-1],Inf)) # id dist #1 A 1.1 #2 A 1.0 #5 A 2.1 #7 A 3.3 #8 A 3.4 
+6
source share

Basic solution R:

 > df[with(df, dist<2*c(dist[-1], tail(dist,1))),] id dist 1 A 1.1 2 A 1.0 5 A 2.1 7 A 3.3 8 A 3.4 

If there are no null elements:

 df[with(df, dist/c(dist[-1], tail(dist,1)))<2,] 
+3
source share

All Articles