A parallel depth search in Erlang is slower than its sequential counterpart

I am trying to implement a modified parallel depth search algorithm in Erlang (let's call it * dfs_mod *).

All I want to get is all the “dead ends”, which are basically the paths that return when * dfs_mod * visits a vertex without neighbors or a vertex with neighbors already visited. I save every path to ets_table1 if my user-defined function fun1(Path) returns true and ets_table2 if fun1(Path) returns false (I need to filter the resulting dead-end paths using some client filter).

I implemented a serial version of this algorithm and for some strange reason, it works better than the parallel one.

The idea of ​​a parallel implementation is simple:

  • visit Vertex from [Vertex|Other_vertices] = Unvisited_neighbours ,
  • add this Vertex to the current path;
  • send {self(), wait} to the 'collector' process;
  • run * dfs_mod * for the Unvisited_neighbours current Vertex in the new process ;
  • continue to work * dfs_mod * with the remaining vertices provided ( Other_vertices );
  • when there are no more vertices to visit, send {self(), done} to the collector process and complete;

So, basically every time I visit a peak with undisclosed neighbors, I create a new depth search process and then continue with other vertices.

Immediately after creating the first process * dfs_mod, I begin to collect all the messages {Pid, wait} and {Pid, done} (message wait - save the collector waiting for all messages done ). N milliseconds after waiting, the collector function returns ok .


For some reason, this parallel implementation runs from 8 to 160 seconds, while the serial version runs for only 4 seconds (testing was carried out on a fully connected digraph with 5 vertices on a machine with an Intel i5 processor).

Here are my thoughts on such poor performance:

  • I pass the Graph digraph to each new process that runs * dfs_mod *. Perhaps the execution of digraph:out_neighbours(Graph) against one digraph of many processes causes this slowness?
  • I am accumulating the current path in the list and passing it to each new process * dfs_mod *, maybe sending so many lists is a problem?
  • I use the ETS table to save the path every time I visit a new vertex and add it to the path. ETS properties ([bag, public,{write_concurrency, true}) , but maybe I'm doing something wrong?
  • every time I go to a new vertex and add it to the path, I check the path using the fun1() custom function (it basically checks to see if there is a path with vertices marked with the letter "n" that appears before the vertices with "m" , and returns true/false depending on the result). Perhaps this fun1() slows down?
  • I tried to run * dfs_mod * without collecting done and wait messages, but htop shows a lot of Erlang activity for quite some time after * dfs_mod * returns ok in the shell, so I don’t think that sending an active message slows down.

How to make my parallel dfs_mod faster than its sequential instance?

Edit : when starting parallel * dfs_mod *, pman , no processes are displayed at all, although htop shows that all 4 processor threads are busy.

+7
source share
1 answer

There is no quick way to find out without code, but here is a quick list of reasons why this might happen:

  • Perhaps you are confusing parallelism and concurrency. The Erlang model has nothing in common and is aimed at concurrency in the first place (regardless of individual units of code). parallelism is only optimization (simultaneous execution of some units of code). Typically, parallelism will take on a higher level form, say you want to run your sort function on 50 different structures - then you decide to run 50 sequential sort functions.

  • You may have synchronization problems or consecutive bottlenecks, effectively changing the parallel solution to a serial one.

  • The overhead of copying data, context switching, and more eclipses the benefits you have in terms of parallelism. This first is especially true for large datasets, which you split into data subdirectories, and then join a large dataset. The latter is especially true for very consistent code, as can be seen from the benchmarks of the process cycle.

If I wanted to optimize this, I would try to minimize messaging and data copying.

If I were the one who worked on this, I would keep the serial version. He does what he says should do, and when part of a larger system, as soon as you have more processes than the kernel, parallelism will come from many calls to the sort function, and not from branches of the sort function. Ultimately, if the part of the server or service using the serial version N times should not have more negative impact than the parallel one, which ultimately creates many, many more processes for the same task and risks overloading the system more.

+8
source

All Articles