Opencl priority queue

Question

Opencl priority queue

I am trying to implement an A-Star search algorithm in OpenCL and cannot find a way to implement a priority queue for it. Here is a general idea of what I'm trying to do in my .cl file

//HOW TO IMPLEMENT THESE?? void extractFromPriorityQueue(); void insertIntoPriorityQueue(); //HOW TO IMPLEMENT THESE?? __kernel void startAStar(//necessary inputs) { int id = get_global_id(0); int currentNode = extractFromPriorityQueue(priorityQueueArray,id); if(currentNode==0){ return; } int earliest_edge = vertexArray[currentNode-1]; int next_vertex_edge = vertexArray[currentNode]; for(int i=earliest_edge;i<next_vertex_edge;i++){ int child = edgeArray[i]; float weight = weightArray[i]; gCostArray[child-1] = gCostArray[currentNode] + weight; hCostArray[child-1] = computeHeuristic(currentNode,child,coordinateArray); fCostArray[child-1] = gCostArray[child-1] + hCostArray[child-1]; insertIntoPriorityQueue(priorityQueueArray,child); } }

Also, is it necessary to synchronize the priority queue in this case?

+4

a-star opencl

user949110 Nov 30 '12 at 7:14

source share

2 answers

Tim child · Answer 1 · 2013-06-29T20:48:18+0000

Below are links to paper, pptx, and the source for various non-blocking data blocks, including a skip list and a priority queue. However, the source code is CUDA. CUDA code is close enough to OpenCL, and you can understand the essence of its implementation in OpenCL.

The priority queue is synchronized using atomic operations. Queue nodes are distributed on the host and transferred as a global array of nodes to the functions. The new node is obtained using the atomic increment of the array counter.

Nodes are inserted into the queue using atomic comparisons and call exchanges. Paper and ppx explain the work and problems of concurrency.

http://www.cse.iitk.ac.in/users/mainakc/projects.html

See the entry on the page above.

Concurrent Programming / Runtime Support [ICPADS 2012] [PDF] [Source] [Voice Slides (PPTX)] Prabhakar Misra and Manak Chaudhuri. Performance evaluation of parallel, loose data structures on GPUs. Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, pp. 53-60, December 2012

Link to the source code http://www.cse.iitk.ac.in/users/mainakc/lockfree.html

gagallo7 · Answer 2 · 2014-01-05T16:51:09+0000

There is another way if atomic operations are not supported. You can use the idea to parallelize the Dijkstra shortest path algorithm from the article by Harish and Narayan. Acceleration of large graph algorithms on a GPU using CUDA .

They suggest duplicating arrays for synchronization with the idea of Mask, Cost, and Update arrays.

The mask is a unique boolean array to represent a queue with this size. If the i element in this array is true, the i element is in the queue.

There are two cores that guarantee synchronization:

The first core reads only the values from the Cost array and only writes to the update arrays.
The second core updates only the cost if they differ from Update. In this case, the updated item will be installed correctly.

The idea works, and there is an implementation made for an example from the OpenCL programming book: https://code.google.com/p/opencl-book-samples/source/browse/trunk/src/Chapter_16#Chapter_16%2FDijkstra .

Opencl priority queue

More articles: