When I execute this in REPL Julia v0.3.7 on my 64-bit Windows 8.1 computer with 8 logical processor cores:
blas_set_num_threads(CPU_CORES)
const v=ones(Float64,100000)
@time for k=1:1000000;s=dot(v,v);end
I observe on the CPU counter of the task manager or Process Explorer that only 12.5% of the CPU is used (1 logical processor core). I also observe the same with Julia v0.3.5, both on Windows 7 and Windows 8.1. I also observe the same behavior, starting with "Julia -p 8" on the command line. Returning to running Julia REPL without the “-p 8” command line option, I tried this test:
blas_set_num_threads(CPU_CORES)
@time peakflops(10000)
In this case, the CPU counter shows the use of 100% CPU.
Because dot()and peakflops()both use BLAS (OpenBLAS in my case), I expect that the number of streams defined blas_set_num_threads(). However, in fact, only the last function is valid. Is the behavior dot()due to an error possibly in OpenBLAS?
I tried to get around Julia’s flaw using the matrix multiplication function. However, I perform operations dot()on the sub-vectors of GB-sized 2D arrays, where the sub-vectors use continuous memory. The matrix multiplies makes me carry every vector that creates a copy. It is expensive inside the inner cycle. So the choice for me seems to be to find out how to use Julia's parallel processing commands / macros or return to Python (where Intel MKL BLAS works as expected for ddot()). Because thedot() - , 99% , , , OpenBLAS Julia, . , , ...
dot(). . SharedArray, ? , SharedArray ? , . 100 000, , dot(), . Julia , BLAS?
:
BLAS v. Julia SharedArray
( ) dot() .
1:
Julia "-p 8" dot() innersimd() :
http://docs.julialang.org/en/release-0.3/manual/performance-tips/
1 . innersimd(), ::Array{Float64, 1}, ::SharedArray{Float64, 1}, 1 .: (
2:
Julia ( BLAS 'gemm!()):
blas_set_num_threads(CPU_CORES)
const A=ones(Float64,(4,100000))
const B=ones(Float64,(100000,4))
@time for k=1:100000;s=A*B;end
"-p", .
3:
Python:
import numpy as np
from scipy.linalg.blas import ddot
from timeit import default_timer as timer
v = np.ones(100000)
start = timer()
for k in range(1000000):
s = ddot(v,v)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")
64- Anaconda3 v2.1.0 WinPython: 7.5 . , Julia 0.3.7 OpenBLAS, 28 . Python 4 , Julia, OpenBLAS ddot().
4:
Python (4xN) * (Nx2), (N = 100000), , . , , 8 , . - Julia Python, Julia : 100000 4 ddot() OpenBLAS ddot() ( )? 4. OpenBLAS , Julia "-p 8" .
5:
Julia v0.3.7 "-p 8", , OpenBLAS gemm!() ( ):
blas_set_num_threads(CPU_CORES)
const a = rand(10000, 10000)
@time a * a