I am trying to optimize the simulation of a simple dynamic system in which the reaction of the network, as well as its parameters (weights) develop in accordance with simple linear equations. Simulations should be performed over tens of millions of time steps, but the size of the network will usually be small. Consequently, performance is less associated with matrix-vector products, but rather with temporary arrays associated with checks and other less noticeable factors. Since I'm new to Julia, I would appreciate any recommendations for further performance optimization.
function train_network(A, T, Of, cs, dt) N, I = size(T) z = zeros(I) r = zeros(N) @inbounds for t in 1:size(cs, 1)
Network time (2,000,000 steps) s
@time train_network(A, T, Of, cs, dt)
gives timings
3.420486 seconds (26.12 M allocations: 2.299 GB, 6.65% gc time)
Update 1
Following David Sanders' advice, I got rid of the devec macro and wrote out the loops. It really reduces the distribution of arrays and improves performance by about 25%, here are the new numbers:
2.648113 seconds (18.00 M allocations: 1.669 GB, 5.60% gc time)
The smaller the network size, the greater the increase. The essence of the updated simulation code can be found here .
Update 2
A significant part of memory allocations is associated with matrix-vector products. Thus, to get rid of those, I replaced these products with the BLAS operation in place of BLAS.genv !, which reduces timings by another 25% and memory allocation by 90%,
1.990031 seconds (2.00 M allocations: 152.589 MB, 0.69% gc time)
Updated code here .
Update 3
The biggest Rank-1 update can also be replaced by two on-site BLAS function calls, namely BLAS.scal! for scaling and BLAS.ger! to update the first rank. The caveat is that both calls are pretty slow if more than one thread is being used (problem with OpenBLAS?), So itβs best to set
blas_set_num_threads(1)
There is a 15 percent increase in timings for a network size of 20 and a gain of 50% for a network with a size of 50. There is no more memory allocation, and new timings
1.638287 seconds (11 allocations: 1.266 KB)
Again updated code can be found here .
Update 4
I wrote a basic Cython script to compare the results so far. The main difference is that I don't use BLAS calls, but have loops: injecting low-level BLAS calls is a pain in Cython, and numpy dot calls have too much overhead for small network sizes (I tried ...). The timing
CPU times: user 3.46 s, sys: 6 ms, total: 3.47 s, Wall time: 3.47 s
which roughly matches the original version (of which 50% is still focused).