BLAS v. Parallel updates for Julia SharedArray objects

I am interested in using Julia SharedArray for a scientific computer project. My current implementation calls on BLAS for all matrix vector operations, but I thought that maybe SharedArray offer some speedup on multi-core machines. My idea is to simply update the output index by index, farming index updates for workflows.

Previous discussions here about SharedArray and here about shared memory objects have not given clear guidance on this issue. It seems intuitively simple, but after testing I am somewhat confused why this approach works so poorly (see code below). For starters, it seems that @parallel for allocates a lot of memory. And if I prefix the loop using @sync , which seems reasonable, if the entire output vector is required later, then the parallel loop will be significantly slower (although without @sync , the loop is very fast).

Am I misinterpreting the correct use of the SharedArray object? Or maybe I have inefficiently assigned calculations?

 ### test for speed gain w/ SharedArray vs. Array ### # problem dimensions n = 10000; p = 25000 # set BLAS threads; 64 seems reasonable in testing blas_set_num_threads(64) # make normal Arrays x = randn(n,p) y = ones(p) z = zeros(n) # make SharedArrays X = convert(SharedArray{Float64,2}, x) Y = convert(SharedArray{Float64,1}, y) Z = convert(SharedArray{Float64,1}, z) # run BLAS.gemv! on Arrays twice, time second case BLAS.gemv!('N', 1.0, x, y, 0.0, z) @time BLAS.gemv!('N', 1.0, x, y, 0.0, z) # does BLAS work equally well for SharedArrays? # check timing result and ensure same answer BLAS.gemv!('N', 1.0, X, Y, 0.0, Z) @time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z) println("$(isequal(z,Z))") # should be true # SharedArrays can be updated in parallel # code a loop to farm updates to worker nodes # use transposed X to place rows of X in columnar format # should (hopefully) help with performance issues from stride Xt = X' @parallel for i = 1:n Z[i] = dot(Y, Xt[:,i]) end # now time the synchronized copy of this @time @sync @parallel for i = 1:n Z[i] = dot(Y, Xt[:,i]) end # still get same result? println("$(isequal(z,Z))") # should be true 

Exiting the test with 4 workers + 1 master node:

 elapsed time: 0.109010169 seconds (80 bytes allocated) elapsed time: 0.110858551 seconds (80 bytes allocated) true elapsed time: 1.726231048 seconds (119936 bytes allocated) true 
+5
source share
1 answer

You have several problems, of which the most important is that Xt[:,i] creates a new array (memory allocation). Here is a demo that brings you closer to what you want:

 n = 10000; p = 25000 # make normal Arrays x = randn(n,p) y = ones(p) z = zeros(n) # make SharedArrays X = convert(SharedArray, x) Y = convert(SharedArray, y) Z = convert(SharedArray, z) Xt = X' @everywhere function dotcol(a, B, j) length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows")) s = 0.0 @inbounds @simd for i = 1:length(a) s += a[i]*B[i,j] end s end function run1!(Z, Y, Xt) for j = 1:size(Xt, 2) Z[j] = dotcol(Y, Xt, j) end Z end function runp!(Z, Y, Xt) @sync @parallel for j = 1:size(Xt, 2) Z[j] = dotcol(Y, Xt, j) end Z end run1!(Z, Y, Xt) runp!(Z, Y, Xt) @time run1!(Z, Y, Xt) zc = copy(sdata(Z)) fill!(Z, -1) @time runp!(Z, Y, Xt) @show sdata(Z) == zc 

Results (when starting julia -p 8 ):

 julia> include("/tmp/paralleldot.jl") elapsed time: 0.465755791 seconds (80 bytes allocated) elapsed time: 0.076751406 seconds (282 kB allocated) sdata(Z) == zc = true 

For comparison, when working on the same computer:

 julia> blas_set_num_threads(8) julia> @time A_mul_B!(Z, X, Y); elapsed time: 0.067611858 seconds (80 bytes allocated) 

So, the original implementation of Julia is at least competitive with BLAS.

+4
source

Source: https://habr.com/ru/post/1212774/


All Articles