Rewrite the loop using

Newbie question: this double cycle on a data frame of about 50 thousand elements is evaluated very slowly, taking more than 30 seconds. I read on the Internet that I have to use some form of the apply function to fix this, but still cannot get the code in order. Starting with the first data.frame that has results in it, the goal is to get a second data.frame where only values ​​that exceed the target are populated, and everyone else has a value of 0.

This code works:

ExcessGain = function(Value, Target){ max(0,Value - Target) } Pcnt_O_O_x = data.frame() for (j in 1:ncol(Pcnt_O_O)){ for (i in 1:nrow(Pcnt_O_O)){ Pcnt_O_O_x[i,j] = ExcessGain(Pcnt_O_O[i,j], GainTargetPcnt) } } 

Can I speed this up by using the apply function instead of the inner loop?

+3
r
source share
1 answer

Your function looks like it just subtracts the target value from the value of each cell in your array. Any negative values ​​are replaced with 0. In this case, you do not need loops, you can just use the R built-in to the vector to do this:

 set.seed(123) # If you have a data.frame of all numeric elements turn it into a matrix first df <- as.matrix( data.frame( matrix( runif(25) , nrow = 5 ) ) ) target <- 0.5 df # X1 X2 X3 X4 X5 #1 0.2875775 0.0455565 0.9568333 0.89982497 0.8895393 #2 0.7883051 0.5281055 0.4533342 0.24608773 0.6928034 #3 0.4089769 0.8924190 0.6775706 0.04205953 0.6405068 #4 0.8830174 0.5514350 0.5726334 0.32792072 0.9942698 #5 0.9404673 0.4566147 0.1029247 0.95450365 0.6557058 df2 <- df - target df2 # X1 X2 X3 X4 X5 #1 -0.21242248 -0.45444350 0.45683335 0.3998250 0.3895393 #2 0.28830514 0.02810549 -0.04666584 -0.2539123 0.1928034 #3 -0.09102308 0.39241904 0.17757064 -0.4579405 0.1405068 #4 0.38301740 0.05143501 0.07263340 -0.1720793 0.4942698 #5 0.44046728 -0.04338526 -0.39707532 0.4545036 0.1557058 df2[ df2 < 0 ] <- 0 df2 # X1 X2 X3 X4 X5 #1 0.0000000 0.00000000 0.4568333 0.3998250 0.3895393 #2 0.2883051 0.02810549 0.0000000 0.0000000 0.1928034 #3 0.0000000 0.39241904 0.1775706 0.0000000 0.1405068 #4 0.3830174 0.05143501 0.0726334 0.0000000 0.4942698 #5 0.4404673 0.00000000 0.0000000 0.4545036 0.1557058 

Here are a few benchmarks showing the difference in speed when working on matrix , as opposed to working on data.frame . f.df( df ) and fm( m ) are two functions that work on data.frame and a matrix with 1 million resepctively elements:

 require( microbenchmark ) microbenchmark( f.df( df ) , fm( m ) , times = 10L ) #Unit: milliseconds # expr min lq median uq max neval # f.df(df) 6944.09808 9009.39684 9233.18528 9533.75089 10036.5963 10 # fm(m) 37.26433 39.00189 40.46229 41.15626 130.6983 10 

Work on the matrix is ​​two orders of magnitude faster when the matrix is ​​large.

If you really need to use the apply function, you can aplpy ​​for each matrix cell as follows:

 m <- matrix( runif(25) , nrow = 5 ) target <- 0.5 apply( m , 1:2 , function(x) max(x - target , 0 ) ) # [,1] [,2] [,3] [,4] [,5] #[1,] 0.4575807 0.0000000 0.15935928 0.0000000 0.1948637 #[2,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 #[3,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 #[4,] 0.3912719 0.0000000 0.06155316 0.1533290 0.0000000 #[5,] 0.3228921 0.4697041 0.23554353 0.1352888 0.0000000 
+3
source share

All Articles