Why is this degenerated Julia code more than 20 times too slow?

I thought in Julia (unlike R or Matlab) the detected code was often faster than the vectorized code. But I do not think so. Here is an example:

julia> x = Float64[1:10000000]; julia> y = Array(Float64, length(x)); julia> @time for i = 1:length(x) y[i] = exp(x[i]) end; elapsed time: 7.014107314 seconds (959983704 bytes allocated, 25.39% gc time) julia> @time y = exp(x); elapsed time: 0.364695612 seconds (80000128 bytes allocated) 

Why is vectorized code so much faster? It seems that the detected code allocates more than 10 times the amount of memory. But in fact, you need to allocate a few bytes to expose any number of floats. Is there a way to write the detected code so that it does not allocate so much memory and therefore runs faster than vector code?

Thanks!

+7
julia-lang
source share
2 answers

Consider the following code snippet:

 x = Float64[1:10000000]; y = Array(Float64, length(x)); function nonglobal_devec!(x,y) for i = 1:length(x) y[i] = exp(x[i]) end end function nonglobal_vec(x) exp(x) end @time nonglobal_devec!(x,y); @time y = nonglobal_vec(x); x = Float64[1:10000000]; y = Array(Float64, length(x)); @time for i = 1:length(x) y[i] = exp(x[i]) end @time y = exp(x) 

which gives times

 A: elapsed time: 0.072701108 seconds (115508 bytes allocated) B: elapsed time: 0.074584697 seconds (80201532 bytes allocated) C: elapsed time: 2.029597656 seconds (959990464 bytes allocated, 22.86% gc time) D: elapsed time: 0.058509661 seconds (80000128 bytes allocated) 

Odd, C, due to its work in a global area where type inference doesn't work, and slower code.

The relative intervals between A and B are subject to some variability due to compiled functions on first use. If we run it again, we get

 A2: elapsed time: 0.038542212 seconds (80 bytes allocated) B2: elapsed time: 0.063630172 seconds (80000128 bytes allocated) 

which makes sense, since A2 does not allocate memory (80 bytes for the return value of the function), and B2 creates a new vector. Also note that B2 allocates the same amount of memory as D - the first time it was memory allocated for compilation.

Finally, detectionism against promissory notes in each case. For example, if you implemented matrix multiplication naively with loops and did not know about caching, you are likely to be much slower than using vectorized A*b using BLAS.

+10
source share
+2
source share

All Articles