Assigning value to each range of consecutive numbers with one sign in R

I am trying to create a data frame where there is a column that contains values ​​representing the length of the runs of positive and negative numbers, for example:

Time V Length 0.5 -2 1.5 1.0 -1 1.5 1.5 0 0.0 2.0 2 1.0 2.5 0 0.0 3.0 1 1.75 3.5 2 1.75 4.0 1 1.75 4.5 -1 0.75 5.0 -3 0.75 

The Length column summarizes the time during which the value was positive or negative. Zeros are assigned 0 , since they are an inflection point. If there is no zero separating the sign change, the values ​​are averaged on both sides of the inflection.

I am trying to approximate the time during which these values ​​spend either positive or negative. I tried this with a for loop with varying degrees of success, but I would like to avoid the loop because I work with extremely large datasets.

I spent some time on sign and diff , as they are used in this question about sign changes . I also looked at this question , which uses transform and aggregate to summarize consecutive duplicate values. I feel like I can use this in combination with sign and / or diff , but I'm not sure how to retroactively assign these sums to the ranges that created them, or how to deal with places where I take the inflection average.

Any suggestions would be appreciated. Here is an example dataset:

 dat <- data.frame(Time = seq(0.5, 5, 0.5), V = c(-2, -1, 0, 2, 0, 1, 2, 1, -1, -3)) 
+7
diff r aggregate transform sign
source share
4 answers

First, find the “Time” indexes that you want to interpolate: a sequential “V” in which there is no zero between positive and negative values; they have an abs(diff(sign(V)) value of two.

 id <- which(abs(c(0, diff(sign(dat$V)))) == 2) 

Add rows with an average "Time" between the corresponding indexes and the corresponding "V" values ​​from zero to the source data. Also add the lines “V” = 0 to “Time” = 0 and in the last step (as suggested by @Gregor). Order "Time".

 d2 <- rbind(dat, data.frame(Time = (dat$Time[id] + dat$Time[id - 1])/2, V = 0), data.frame(Time = c(0, max(dat$Time)), V = c(0, 0)) ) d2 <- d2[order(d2$Time), ] 

Calculate the time difference between time steps that are zero and replicate them using “zero group indexes”.

 d2$Length <- diff(d2$Time[d2$V == 0])[cumsum(d2$V == 0)] 

Add values ​​to the source data:

 merge(dat, d2) # Time V Length # 1 0.5 -2 1.50 # 2 1.0 -1 1.50 # 3 1.5 0 1.00 # 4 2.0 2 1.00 # 5 2.5 0 1.75 # 6 3.0 1 1.75 # 7 3.5 2 1.75 # 8 4.0 1 1.75 # 9 4.5 -1 0.75 # 10 5.0 -3 0.75 

Set "Length" to 0 where V == 0 .

+3
source share

This works, at least for your test case. And that should be pretty effective. He makes some assumptions, I will try to point out large ones.

First we extract the vectors and insert 0s at the beginning. We also set the last V to 0. The calculation will be based on time differences between 0s, so we need to start and end 0s. Your example, apparently, explicitly assumes V = 0 at Time = 0 , hence the initial 0, and it stops abruptly at maximum time, so we also set V = 0 :

 Time = c(0, dat$Time) V = c(0, dat$V) V[length(V)] = 0 

To fill in the missing 0, we use approx to linearly approximate sign(V) . He also assumes that the sampling frequency is regular, so we can get away with doubling the frequency to get all the missing 0.

 ap = approx(Time, sign(V), xout = seq(0, max(Time), by = 0.25)) 

The values ​​we want to fill in are the durations between 0s, both observable and approximable. In the correct order, this is:

 dur = diff(ap$x[ap$y == 0]) 

Finally, we need input data indices to populate the duration. This is the hacker part of this answer, but it seems to work. Perhaps someone will offer a pleasant simplification.

 # first use rleid to get the sign groupings group = data.table::rleid(sign(dat$V)) # then we need to set the groups corresponding to 0 values to 0 # and reduce any group numbers following 0s correspondingly # lastly we add 1 to everything so that we can stick 0 at the # front of our durations and assign those to the 0 V values ind = (group - cumsum(dat$V == 0)) * (dat$V != 0) + 1 # fill it in dat$Length = c(0, dur)[ind] dat # Time V Length # 1 0.5 -2 1.50 # 2 1.0 -1 1.50 # 3 1.5 0 0.00 # 4 2.0 2 1.00 # 5 2.5 0 0.00 # 6 3.0 1 1.75 # 7 3.5 2 1.75 # 8 4.0 1 1.75 # 9 4.5 -1 0.75 # 10 5.0 -3 0.75 
+3
source share

It took me more time than I would like to admit, but here is my solution.

Since you said you want to use it on large datasets (speed does matter), I use Rcpp to write a loop that does all the checking. For speed comparison, I also create another sample of data with 500,000 data.points and check the speed (I tried to compare with other data sets, but could not transfer them to data.table (without this, it would be an unfair comparison ...)). If this is done, I am happy to update the speed of comparison!

Part 1: My decision

My solution looks like this:

(in length_time.cpp )

 #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector length_time(NumericVector time, NumericVector v) { double start = 0; double time_i, v_i; bool last_positive = v[0] > 0; bool last_negative = v[0] < 0; int length_i = time.length(); NumericVector ret_vec(length_i); for (int i = 0; i < length_i; ++i) { time_i = time[i]; v_i = v[i]; if (v_i == 0) { // injection if (i > 0) { // if this is not the beginning, then a regime has ended! ret_vec[i - 1] = time_i - start; start = time_i; } } else if ((v_i > 0 && last_negative) || (v_i < 0 && last_positive)) { ret_vec[i - 1] = (time_i + time[i - 1]) / 2 - start; start = (time_i + time[i - 1]) / 2; } last_positive = v_i > 0; last_negative = v_i < 0; } ret_vec[length_i - 1] = time[length_i - 1] - start; // ret_vec now only has the values for the last observation // do something like a reverse na_locf... double tmp_val = ret_vec[length_i - 1]; for (int i = length_i - 1; i >= 0; --i) { if (v[i] == 0) { ret_vec[i] = 0; } else if (ret_vec[i] == 0){ ret_vec[i] = tmp_val; } else { tmp_val = ret_vec[i]; } } return ret_vec; } 

and then in the R file (i.e. length_time.R ):

 library(Rcpp) # setwd("...") #to find the .cpp-file sourceCpp("length_time.cpp") dat$Length <- length_time(dat$Time, dat$V) dat # Time V Length # 1 0.5 -2 1.50 # 2 1.0 -1 1.50 # 3 1.5 0 0.00 # 4 2.0 2 1.00 # 5 2.5 0 0.00 # 6 3.0 1 1.75 # 7 3.5 2 1.75 # 8 4.0 1 1.75 # 9 4.5 -1 0.75 # 10 5.0 -3 0.75 

Which seems to work with a sample dataset.

Part 2: Speed ​​Testing

 library(data.table) library(microbenchmark) n <- 10000 set.seed(1235278) dt <- data.table(time = seq(from = 0.5, by = 0.5, length.out = n), v = cumsum(round(rnorm(n, sd = 1)))) dt[, chg := v >= 0 & shift(v, 1, fill = 0) <= 0] plot(dt$time, dt$v, type = "l") abline(h = 0) for (i in dt[chg == T, time]) abline(v = i, lty = 2, col = "red") 

This leads to a dataset with 985 observations (intersections).

length_time

Speed ​​testing with micro detection results in

 microbenchmark(dt[, length := length_time(time, v)]) # Unit: milliseconds # expr min lq mean median uq max neval # dt[, `:=`(length, length_time(time, v))] 2.625714 2.7184 3.054021 2.817353 3.077489 5.235689 100 

3 millisecond result for calculation with 500,000 observations.

Does this help you?

+2
source share

Here is my attempt made completely in base R

 Joseph <- function(df) { is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol v <- df$V t <- df$Time sv <- sign(v) nR <- length(v) v0 <- which(v==0) id <- which(abs(c(0, diff(sv))) > 1) ## This line and (t[id] + t[id - 1L])/2 From @Henrik myZeros <- sort(c(v0*t[1L], (t[id] + t[id - 1L])/2)) lenVals <- diff(c(0,myZeros,t[nR])) ## Actual values that ## will populate the Length column ## remove values that result from repeating zeros from the df$V column lenVals <- lenVals[lenVals != t[1L] | c(!is.wholenumber(myZeros/t[1L]),F)] ## Below we need to determine how long to replicate ## each of the lenVals above, so we need to find ## the starting place and length of each run... ## rle is a great candidate for both of these m <- rle(sv) ml <- m$lengths cm <- cumsum(ml) zm <- m$values != 0 ## non-zero values ie we won't populate anything here rl <- m$lengths[zm] ## non-zero run-lengths st <- cm[zm] - rl + 1L ## starting index out <- vector(mode='numeric', length = nR) for (i in 1:length(st)) {out[st[i]:(st[i]+rl[i]-1L)] <- lenVals[i]} df$Length <- out df } 

Here is the result of this example:

 Joseph(dat) Time V Length 1 0.5 -2 1.50 2 1.0 -1 1.50 3 1.5 0 0.00 4 2.0 2 1.00 5 2.5 0 0.00 6 3.0 1 1.75 7 3.5 2 1.75 8 4.0 1 1.75 9 4.5 -1 0.75 10 5.0 -3 0.75 

Here is an example:

 set.seed(142) datBig <- data.frame(Time=seq(0.5,50000,0.5), V=sample(-3:3, 10^5, replace=TRUE)) library(compiler) library(data.table) library(microbenchmark) c.Joseph <- cmpfun(Joseph) c.Henrik <- cmpfun(Henrik) c.Gregor <- cmpfun(Gregor) microbenchmark(c.Joseph(datBig), c.Gregor(datBig), c.Henrik(datBig), David(datBig), times = 10) Unit: milliseconds expr min lq mean median uq max neval cld David(datBig) 2.20602 2.617742 4.35927 2.788686 3.13630 114.0674 10 a c.Joseph(datBig) 61.91015 62.62090 95.44083 64.43548 93.20945 225.4576 10 b c.Gregor(datBig) 59.25738 63.32861 126.29857 72.65927 214.35961 229.5022 10 b c.Henrik(datBig) 1511.82449 1678.65330 1727.14751 1730.24842 1816.42601 1871.4476 10 c 

As Mr. Gregor noted, the goal is to find the x-distance between each occurrence of zero. This can be seen visually by plotting (again, as @Gregor pointed out (lots of kudos btw)). For example, if we build the first 20 datBig values, we get: enter image description here

This shows that x-distances are such that the graph is either positive or negative (i.e., it is not equal to zero (this happens when zeros are repeated)) approximately:

2.0, 1.25, 0.5, 0.75, 2.0, 1.0, 0.75, 0.5

 t1 <- c.Joseph(datBig) t2 <- c.Gregor(datBig) t3 <- c.Henrik(datBig) t4 <- David(datBig) ## Correct values according to the plot above (x above a value indicates incorrect value) ## 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50 ## all correct t1$Length[1:20] [1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50 ## mostly correct t2$Length[1:20] xxxxx [1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 0.75 0.75 0.75 0.00 0.00 0.00 0.50 0.00 0.75 0.25 ## least correct t3$Length[1:20] xxxxxxxxxxxxx [1] 2.00 2.00 2.00 0.50 1.00 1.25 0.75 1.25 0.00 1.75 1.75 0.00 1.50 1.50 0.00 0.00 1.25 1.25 1.25 1.25 ## all correct t4$Length[1:20] [1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50 # agreement with David solution all.equal(t4$Length, t1$Length) [1] TRUE 

Well, it seems that the Rcpp solution provided by David is not only accurate, but quickly blazing.

+2
source share

All Articles