Julia: Optimize Simulation of a Simple Dynamic System

I am trying to optimize the simulation of a simple dynamic system in which the reaction of the network, as well as its parameters (weights) develop in accordance with simple linear equations. Simulations should be performed over tens of millions of time steps, but the size of the network will usually be small. Consequently, performance is less associated with matrix-vector products, but rather with temporary arrays associated with checks and other less noticeable factors. Since I'm new to Julia, I would appreciate any recommendations for further performance optimization.

function train_network(A, T, Of, cs, dt) N, I = size(T) z = zeros(I) r = zeros(N) @inbounds for t in 1:size(cs, 1) # precompute Az = A*z Ofr = Of*r # compute training signal @devec z += dt.*(Az + cs[t] - 0.5.*z) I_teach = T*(Az + cs[t]) Tz = T*z # rate updates @devec r += dt.*(I_teach - Ofr - 0.1.*r) # weight updates for i in 1:I @devec T[:, i] += dt.*1e-3.*(z[i].*r - T[:, i]) end for n in 1:N @devec Of[:, n] += dt.*1e-3.*(Tz.*r[n] - Of[:, n]) end end end # init parameters N, I = 20, 2 dt = 1e-3 # init weights T = rand(N, I)*N A = rand(I, I) Of = rand(N, N)/N # simulation time & input sim_T = 2000 ts = 0:dt:sim_T cs = randn(size(ts, 1), I) 

Network time (2,000,000 steps) s

 @time train_network(A, T, Of, cs, dt) 

gives timings

 3.420486 seconds (26.12 M allocations: 2.299 GB, 6.65% gc time) 

Update 1

Following David Sanders' advice, I got rid of the devec macro and wrote out the loops. It really reduces the distribution of arrays and improves performance by about 25%, here are the new numbers:

 2.648113 seconds (18.00 M allocations: 1.669 GB, 5.60% gc time) 

The smaller the network size, the greater the increase. The essence of the updated simulation code can be found here .

Update 2

A significant part of memory allocations is associated with matrix-vector products. Thus, to get rid of those, I replaced these products with the BLAS operation in place of BLAS.genv !, which reduces timings by another 25% and memory allocation by 90%,

 1.990031 seconds (2.00 M allocations: 152.589 MB, 0.69% gc time) 

Updated code here .

Update 3

The biggest Rank-1 update can also be replaced by two on-site BLAS function calls, namely BLAS.scal! for scaling and BLAS.ger! to update the first rank. The caveat is that both calls are pretty slow if more than one thread is being used (problem with OpenBLAS?), So it’s best to set

 blas_set_num_threads(1) 

There is a 15 percent increase in timings for a network size of 20 and a gain of 50% for a network with a size of 50. There is no more memory allocation, and new timings

 1.638287 seconds (11 allocations: 1.266 KB) 

Again updated code can be found here .

Update 4

I wrote a basic Cython script to compare the results so far. The main difference is that I don't use BLAS calls, but have loops: injecting low-level BLAS calls is a pain in Cython, and numpy dot calls have too much overhead for small network sizes (I tried ...). The timing

 CPU times: user 3.46 s, sys: 6 ms, total: 3.47 s, Wall time: 3.47 s 

which roughly matches the original version (of which 50% is still focused).

+8
performance optimization dynamic-programming julia-lang
source share
1 answer

Although you are using the Devectorize.jl package, I suggest you just write all these vectorized operations explicitly as simple loops. I expect this to give you a significant performance boost.

Devectorize package is certainly a big contribution, but to see how it crashes to do the dirty work for you, you can do something like this (example from the README package):

 using Devectorize a = rand(2,2); b = rand(2,2); c = rand(2,2); julia> macroexpand(:(@devec r = exp(a + b) .* sum(c))) 

Here macroexpand is a function that tells you the code to which the @devec macro expands its argument (the code in the rest of the line). I will not be surprised to show the result here, but it is not just a simple for loop that you write manually.

In addition, the fact that you have a huge selection indicates that not all vector operations are handled correctly.

By the way, do not forget to make a small run so that you do not synchronize the compilation stage.

[Tangential note: here exp is a function that applies a regular exponential function to each element of the matrix equivalent to map(exp, a+b) . expm gives the exponent of the matrix. There was talk of abandoning such use of exp .]

+5
source share

All Articles