I have a large vector of line vectors: There are about 50,000 line vectors, each of which contains 2-15 lines, 1-20 characters long.
MyScoringOperation is a function that works with a row vector (database) and returns an array of 10,100 points (like Float64s). MyScoringOperation requires an average of 0.01 seconds to start (depending on the length of the binding)
function MyScoringOperation(state:State, datum::Vector{String}) ... score::Vector{Float64}
I have what constitutes a nested loop. The outer loop will usually run for 500 iterations
data::Vector{Vector{String}} = loaddata() for ii in 1:500 score_total = zeros(10100) for datum in data score_total+=MyScoringOperation(datum) end end
On one computer, on a small test example of 3000 (and not 50,000), it takes 100-300 seconds per external cycle.
I have 3 powerful servers with Julia 3.9 installed (and you can get 3 more easily, and then you can get hundreds more on the next scale).
I have basic experience with @parallel, however it seems to spend a lot of time copying the constant (it more or less hangs on a smaller test case)
It looks like this:
data::Vector{Vector{String}} = loaddata() state = init_state() for ii in 1:500 score_total = @parallel(+) for datum in data MyScoringOperation(state, datum) end state = update(state, score_total) end
My understanding of how this implementation works with @parallel is that it:
For each ii :
- sections of
data in the cartridge for each worker - sends this cartridge to each employee
- works, all processes there pieces
- The main procedure summarizes the results as they become available.
I would like to delete step 2, so instead of sending part of the data to each employee, I just send a series of indexes to each employee and they look at it from their copy of data . or even better, only providing each with their own piece and reusing it each time (saving on a lot of RAM).
Profiling confirms my belief in the functioning of @parellel. For a task with a similar area (with even less data), the non-parallel version works in 0.09 seconds, and parallel work in AND profiler shows that almost all the time is spent for 185 seconds. The profiler shows that almost 100% of this is spent on interacting with the IO network.