My question first:
- Is it possible for Julia not to copy variables every time in parallel for the loop?
- if not, how to implement parallel shrinking operations in Julia?
Now the details:
I have this program:
data = DataFrames.readtable("...") # a big baby (~100MB) filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame filtered_data = @parallel vcat for fct in filter_functions fct(data)::DataFrame end
It works with good functionality, but each parallel call to the fct (data) of another worker copies the entire data frame, making everything very slow.
Ideally, I would like to download data once and always use each pre-loaded data at each workstation. I came up with this code for this:
@everywhere data = DataFrames.readtable("...") # a big baby (~100MB) @everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame @everywhere for i in 1:length(filter_functions) if (myid()-1) % nworkers() fct = filter_functions[i] filtered_data_temp = fct(data) end # How to vcat all the filtered_data_temp ? end
But now I have another problem: I can’t understand how vcat () all filter_data_temp into a variable in the working file with myid () == 1.
I would really appreciate your understanding.
Note. I know that works in parallel with Julia’s large persistent data structure . However, I don’t think this applies to my problem, because all my filter_functions really work with the array as a whole.
parallel-processing julia-lang
Antoine trouve
source share