Moving values ​​between lines without a for loop in R

I wrote the code used to organize the data sampled at different frequencies, but I made extensive use of for-loops, which significantly slow down the code when using a large data set. I look through my code, find ways to remove for-loops to speed it up, but one of the loops puzzled me.

As an example, suppose that the data was taken at a frequency of 3 Hz, so I get three rows for every second of data. However, variables A, B, and C are sampled at a frequency of 1 Hz each, so I get one value every three lines for each of them. Variables are selected sequentially during one second period, which leads to the diagonal nature of the data.

To complicate matters even further, sometimes a row is lost in the original dataset.

My goal is this: by identifying the rows that I want to keep, I want to move the non-NA values ​​from subsequent rows to the custodian rows. If not for the problem with the lost data, I would always leave a line containing the value for the first variable, but if one of these lines is lost, I will store the next line.

In the example below, the sixth sample and tenth sample are lost.

A <- c(1, NA, NA, 4, NA, 7, NA, NA, NA, NA) B <- c(NA, 2, NA, NA, 5, NA, 8, NA, 11, NA) C <- c(NA, NA, 3, NA, NA, NA, NA, 9, NA, 12) test_df <- data.frame(A = A, B = B, C = C) test_df ABC 1 1 NA NA 2 NA 2 NA 3 NA NA 3 4 4 NA NA 5 NA 5 NA 6 7 NA NA 7 NA 8 NA 8 NA NA 9 9 NA 11 NA 10 NA NA 12 keep_rows <- c(1, 4, 6, 9) 

After I moved the values ​​to the lines of the keeper, I will delete the intermediate lines, as a result we get the following:

 test_df <- test_df[keep_rows, ] test_df ABC 1 1 2 3 2 4 5 NA 3 7 8 9 4 NA 11 12 

In the end, I only need one row for every second of data, and NA values ​​should remain only where the row of source data was lost.

Does anyone have any ideas on how to move data without using a for loop? I would be grateful for any help! Sorry if this question is too verbose; I wanted to make a mistake on the side of too much information, and not on the insufficient.

+7
source share
3 answers

This should do it:

 test_df = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)])) test_df = data.frame(test_df[!apply(test_df, 1, function(x) all(is.na(x))), ]) colnames(test_df) = c('A', 'B', 'C') 
 > test_df ABC 1 1 2 3 2 4 5 NA 3 7 8 9 4 NA 11 12 

And if you want something even faster:

 test_df = data.frame(test_df[rowSums(is.na(test_df)) != ncol(test_df), ]) 
+5
source

Based on @John Colby's big answer, we can get rid of the application step and speed it up a bit (about 20 times):

 # Create a bigger test set A <- c(1, NA, NA, 4, NA, 7, NA, NA, NA, NA) B <- c(NA, 2, NA, NA, 5, NA, 8, NA, 11, NA) C <- c(NA, NA, 3, NA, NA, NA, NA, 9, NA, 12) n=1e6; test_df = data.frame(A=rep(A, len=n), B=rep(B, len=n), C=rep(C, len=n)) # John Colby method, 9.66 secs system.time({ df1 = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)])) df1 = data.frame(df1[!apply(df1, 1, function(x) all(is.na(x))), ]) colnames(df1) = c('A', 'B', 'C') }) # My method, 0.48 secs system.time({ df2 = with(test_df, data.frame(A=A[1:(length(A)-2)], B=B[2:(length(B)-1)], C=C[3:length(C)])) df2 = df2[is.finite(with(df2, A|B|C)),] row.names(df2) <- NULL }) identical(df1, df2) # TRUE 

... The trick is that A|B|C is only NA if all values ​​are NA . This turns out to be much faster than calling all(is.na(x)) for each row of the matrix with apply .

EDIT @ John has a different approach that also speeds it up. I added code to convert the result to data.frame with the correct names and confined to it. This seems to be at the same speed as my solution.

 # John method, 0.50 secs system.time({ test_m = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)])) test_m[is.na(test_m)] <- -1 test_m <- test_m[rowSums(test_m) > -3,] test_m[test_m == -1] <- NA df3 <- data.frame(test_m) colnames(df3) = c('A', 'B', 'C') }) identical(df1, df3) # TRUE 

EDIT AGAIN ... and @John Colby updated answer even faster!

 # John Colby method, 0.39 secs system.time({ df4 = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)])) df4 = data.frame(df4[rowSums(is.na(df4)) != ncol(df4), ]) colnames(df4) = c('A', 'B', 'C') }) identical(df1, df4) # TRUE 
+3
source

So your question was to move without a loop. Therefore, apparently, you have already decided the first step.

 > test_m <- with( test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)]) ) > test_m [,1] [,2] [,3] [1,] 1 2 3 [2,] NA NA NA [3,] NA NA NA [4,] 4 5 NA [5,] NA NA NA [6,] 7 8 9 [7,] NA NA NA [8,] NA 11 12 

This is now the matrix. You can easily exclude rows for which there is now no data point without a loop. If you want to return to data.frame, you can use a different method, but this one will work the fastest for large data mounts. I like to just make NA an impossible value ... maybe -1, but you'll know your data better ... maybe -pi.

 test_m[is.na(test_m)] <- -1 

Now just select the lines for the property of those impossible numbers

 test_m <- test_m[rowSums(test_m) > -3,] 

And, if you want, you can return NA.

 test_m[test_m == -1] <- NA test_m [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 NA [3,] 7 8 9 [4,] NA 11 12 

There is no loop ( for or apply ), and one function applied to the rows of the matrix is ​​specially optimized and runs very quickly (rowSums).

+2
source

All Articles