Parallel computing in Julia with big data

My question first:

  • Is it possible for Julia not to copy variables every time in parallel for the loop?
  • if not, how to implement parallel shrinking operations in Julia?

Now the details:

I have this program:

data = DataFrames.readtable("...") # a big baby (~100MB) filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame filtered_data = @parallel vcat for fct in filter_functions fct(data)::DataFrame end 

It works with good functionality, but each parallel call to the fct (data) of another worker copies the entire data frame, making everything very slow.

Ideally, I would like to download data once and always use each pre-loaded data at each workstation. I came up with this code for this:

 @everywhere data = DataFrames.readtable("...") # a big baby (~100MB) @everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame @everywhere for i in 1:length(filter_functions) if (myid()-1) % nworkers() fct = filter_functions[i] filtered_data_temp = fct(data) end # How to vcat all the filtered_data_temp ? end 

But now I have another problem: I can’t understand how vcat () all filter_data_temp into a variable in the working file with myid () == 1.

I would really appreciate your understanding.

Note. I know that works in parallel with Julia’s large persistent data structure . However, I don’t think this applies to my problem, because all my filter_functions really work with the array as a whole.

+8
parallel-processing julia-lang
source share
2 answers

In the end, I found a solution to my question there: Julia: How to copy data to another processor in Julia .

In particular, he introduces the following primitive to extract a variable from another process:

 getfrom(p::Int, nm::Symbol; mod=Main) = fetch(@spawnat(p, getfield(mod, nm))) 

Below I use it:

 @everywhere data = DataFrames.readtable("...") # a big baby (~100MB) @everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame # Executes the filter functions @everywhere for i in 1:length(filter_functions) local_results = ... # some type if (myid()-1) % nworkers() fct = filter_functions[i] filtered_data_temp = fct(data) local_results = vcat(local_results, filtered_data_temp) end # How to vcat all the filtered_data_temp ? end # Concatenate all the local results all_results = ... # some type for wid in 1:workers() worker_local_results = getfrom(wid, :local_results) all_results = vcat(all_results,worker_local_results) end 
+4
source share

You might want to browse / upload your data to Distributed Arrays

EDIT: Maybe something like this:

 data = DataFrames.readtable("...") dfiltered_data = distribute(data) #distributes data among processes automagically filter_functions = [ fct1, fct2, fct3 ... ] for fct in filter_functions dfiltered_data = fct(dfiltered_data)::DataFrame end 

You can also check unit tests for more examples.

+10
source share

All Articles