How can I find which jobs in the dependency tree can run in parallel?

Question

How can I find which jobs in the dependency tree can run in parallel?

I am currently using a graph to store dependencies, and then run all the vertices that have no dependencies. It works, but it feels stupid. Is there a better algorithm or data structure that I should use?

#!/usr/bin/perl use strict; use warnings; use Graph; #FIXME: naive implementation, there may be a much better way to do this sub run_in_parallel { my $g = shift->copy; while (my @v = $g->vertices) { my @run = grep { $g->is_successorless_vertex($_) } @v; print "running ", join(", ", @run), " in parallel\n"; for my $compenent (@run) { $g->delete_vertex($compenent); }; } } my $g = Graph->new; while (<DATA>) { my ($component, @dependencies) = split; unless ($g->has_vertex($component)) { $g->add_vertex($component); } for my $dependency (@dependencies) { unless ($g->has_vertex($dependency)) { $g->add_vertex($dependency); } $g->add_edge($component, $dependency); } } run_in_parallel($g); #component dependency list __DATA__ abcd be cfg d e f g

+6

algorithm data-structures perl

Chas. Owens Aug 18 '10 at 15:45

source share

6 answers

reinierpost · Answer 1 · 2010-08-18T16:31:27+0000

Another idea is to use a Petri net . The nodes in your schedule are its places, actions are its transitions. Each place must have exactly one enable transition, even if it has no dependencies. This way, you don’t even need to place initial tokens.

It also includes a proposal by Karl Bielefeld.

Karl Bielefeldt · Answer 2 · 2010-08-18T16:09:31+0000

You can run any tasks in parallel without any unfinished dependencies. For example, in the data set shown, you can run d, e, f, and g in parallel at the beginning. When f and g are finished, you can run c in parallel, even if d and e are still running, etc. Your algorithm is simply necessary every time the task ends, to mark it as completed and re-evaluate if any tasks are now available for launch.

reinierpost · Answer 3 · 2010-08-18T16:20:09+0000

I do not think this is kludgy, except that the syntax is a bit verbose. You can fix this with shell functions.

You maintain dependencies that are out of order, but you do not yet support loops.

It very clearly describes what you need to do in such a way that it is very easy to verify the correctness of its implementation.

I could also create a reverse graph and use it to bypass, for example. with Graph :: Traversal :: BFS.

Perhaps I would not use Graph, but would present graphs as a hashrefs hash if the graph would not be used for other purposes (for example, printed in the form of a diagram or analyzed or rewritten using graph algorithms).

You might want to add loop detection.

hobbs · Answer 4 · 2010-08-18T17:40:31+0000

It just happens that I thought about this recently, answering a question from another.

Your algorithm as a whole looks right, although I'm not sure if there is a more efficient way to solve it. It’s one thing: you should not pre-comprehend how you are in your sample code. which is triggered by a shutter; therefore, if ABC starts from the very beginning, it will start all three and wait for all three to complete before looking further down the list. But what if D depends only on A , and A is the first task to be completed, before B and C ? You lose the ability to not start D right away, so you really want to save the schedule when completing tasks and re-interview ongoing tasks every time the task ends if you no longer have a full lineup.

I really started writing the answer using Algorithm :: Dependency :: Ordered , which is very strong in the spirit of the Ether answer. The ease of working with memory is less than Graph , but I just realized that it partially depends on the same problem with use. I think this might be working, but in this case I will update here when I can. :)

Ether · Answer 5 · 2010-08-18T16:40:12+0000

Fur, I do not know if I will worry about the full graphical representation here; instead, use either a list or a hash to store tasks (depending on whether you need random access, or if you sort by priority or other restriction) and save its direct dependencies in the subfield.

In fact, you don’t really want to calculate which tasks you can run in parallel, since this list will constantly change as tasks complete. Instead, just run everything you can right away, and when you have the resources available, look dynamically for the next best candidate.

 # pseudocode: use List::MoreUtils 'any'; while (my $job = get_next_job_that_I_would_ideally_run_now()) { # any { not } is subtly different than not all {} -- see how empty lists are handled! next if any { not $_->finished() } $job->get_dependent_jobs(); # job has no dependencies or they are all complete: safe to run. # This probably involves forking or spawning a subthread to perform this # execution, and when it is done, the finished() state will be set so its # upstream dependencies can then run. $job->execute; # if no more resources are available for starting more jobs, wait here until # something is freed up; then we can continue on with our loop and look for the # next best job to start. }

daotoad · Answer 6 · 2010-08-18T18:22:42+0000

It seems to me that you can get rid of most of the vertex grepping by tracking the end nodes when adding them to the graph and when processing the node list.

 if( @dependencies ) { for my $dependency (@dependencies) { unless ($g->has_vertex($dependency)) { $g->add_vertex($dependency); } $g->add_edge($component, $dependency); } } else { push @end_components, $component; }

Then, when you process your dataset, start one thread of execution (however you implement your parallelism) for each end of the node. When the thread terminates, all parent nodes without other successors are added to the end of the node. Continue to process the end of the node until both the graphs and the node are empty.

 sub run_in_parallel { my $g = shift->copy; my $end_vertices = shift; my @queue; while( $g->vertices ) { print "running ", join(", ", @queue), " in parallel\n"; for my $compenent (@queue) { # When process finished. $g->delete_vertex($component); push @queue, grep { $g->is_successorless_vertex($_) } $g->predecessors($v); } # Add some error check for non-empty graph + empty queue + no processes waiting to finish. } }

This change adds a bit of writing, and is a little more fragile than your original idea. Consider encapsulating these behaviors in an object that either contains or inherits from Graph .

The biggest performance benefit will be when using large dependency graphs. Minimal improvement will be observed with small graphs (where the cost of grepping is tiny compared to the cost of process control) or where the graphs are very small.

This code is not fully verified.

How can I find which jobs in the dependency tree can run in parallel?

More articles: