Avoiding serial processing of data.frame in R

I was wondering what is the best way to avoid string handling in R, most of them are executed differently in internal C procedures. For example: I have a data frame a :

  chromosome_name start_position end_position strand 1 15 35574797 35575181 1 2 15 35590448 35591641 -1 3 15 35688422 35688645 1 4 13 75402690 75404217 1 5 15 35692892 35693969 1 

I want: based on whether the line is positive or negative, startOFgene as start_position or end_position . One way to avoid the for loop is to split the data.frame with the +1 chain and the thread and perform the selection. What could be another way to speed up? A method does not scale if it has some other complex processing for each row.

+4
source share
2 answers

Maybe it's fast enough ...

 transform(a, startOFgene = ifelse(strand == 1, start_position, end_position)) chromosome_name start_position end_position strand startOFgene 1 15 35574797 35575181 1 35574797 2 15 35590448 35591641 -1 35591641 3 15 35688422 35688645 1 35688422 4 13 75402690 75404217 1 75402690 5 15 35692892 35693969 1 35692892 
+5
source

First, since all your columns are integer / numeric, you can use a matrix instead of data.frame. Many operations on a matrix are much faster than the same operation on a data.frame, although in this case they are not very different. Then you can use a logical subset to create the startOFgene column.

 # Create some large-ish data M <- do.call(rbind,replicate(1e3,as.matrix(a),simplify=FALSE)) M <- do.call(rbind,replicate(1e3,M,simplify=FALSE)) A <- as.data.frame(M) # Create startOFgene column in a matrix m <- function() { M <- cbind(M, startOFgene=M[,"start_position"]) negStrand <- sign(M[,"strand"]) < 0 M[negStrand,"startOFgene"] <- M[negStrand,"end_position"] } # Create startOFgene column in a data.frame d <- function() { A$startOFgene <- A$start_position negStrand <- sign(A$strand) < 0 A$startOFgene[negStrand] <- A$end_position[negStrand] } library(rbenchmark) benchmark(m(), d(), replications=10)[,1:6] # test replications elapsed relative user.self sys.self # 2 d() 10 18.804 1.000 16.501 2.224 # 1 m() 10 19.713 1.048 16.457 3.152 
+3
source

All Articles