Parallel computing in Julia with big data

Question

Parallel computing in Julia with big data

My question first:

Is it possible for Julia not to copy variables every time in parallel for the loop?
if not, how to implement parallel shrinking operations in Julia?

Now the details:

I have this program:

data = DataFrames.readtable("...") # a big baby (~100MB) filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame filtered_data = @parallel vcat for fct in filter_functions fct(data)::DataFrame end

It works with good functionality, but each parallel call to the fct (data) of another worker copies the entire data frame, making everything very slow.

Ideally, I would like to download data once and always use each pre-loaded data at each workstation. I came up with this code for this:

 @everywhere data = DataFrames.readtable("...") # a big baby (~100MB) @everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame @everywhere for i in 1:length(filter_functions) if (myid()-1) % nworkers() fct = filter_functions[i] filtered_data_temp = fct(data) end # How to vcat all the filtered_data_temp ? end

But now I have another problem: I can’t understand how vcat () all filter_data_temp into a variable in the working file with myid () == 1.

I would really appreciate your understanding.

Note. I know that works in parallel with Julia’s large persistent data structure . However, I don’t think this applies to my problem, because all my filter_functions really work with the array as a whole.

+8

parallel-processing julia-lang

Antoine trouve Jul 28 '15 at 2:08

source share

2 answers

You might want to browse / upload your data to Distributed Arrays

EDIT: Maybe something like this:

 data = DataFrames.readtable("...") dfiltered_data = distribute(data) #distributes data among processes automagically filter_functions = [ fct1, fct2, fct3 ... ] for fct in filter_functions dfiltered_data = fct(dfiltered_data)::DataFrame end

You can also check unit tests for more examples.

+10

Felipe lema Jul 28 '15 at 12:51

source share

Antoine trouve · Accepted Answer · 2015-08-10T07:27:15+0000

In the end, I found a solution to my question there: Julia: How to copy data to another processor in Julia .

In particular, he introduces the following primitive to extract a variable from another process:

 getfrom(p::Int, nm::Symbol; mod=Main) = fetch(@spawnat(p, getfield(mod, nm)))

Below I use it:

 @everywhere data = DataFrames.readtable("...") # a big baby (~100MB) @everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame # Executes the filter functions @everywhere for i in 1:length(filter_functions) local_results = ... # some type if (myid()-1) % nworkers() fct = filter_functions[i] filtered_data_temp = fct(data) local_results = vcat(local_results, filtered_data_temp) end # How to vcat all the filtered_data_temp ? end # Concatenate all the local results all_results = ... # some type for wid in 1:workers() worker_local_results = getfrom(wid, :local_results) all_results = vcat(all_results,worker_local_results) end

Parallel computing in Julia with big data

More articles: