Is it possible to split Cuda jobs between GPU and processor?

I am having trouble understanding how and how to split the workload between gpu and the processor. I have a large log file that I need to read each line and then run about 5 million operations (testing for various scenarios). My current approach was to read a few hundred lines, add it to an array, and then send it to every GPU that works fine, but because so much work on a line and so many lines takes a lot of time. I noticed that although this happens, my kernel kernels basically do nothing. I use EC2, so I have 2 Xeon quad processors and 2 Tesla, one processor reads the file (the main program works), and the GPU does the job, so I wonder how and what can I do to attract the other 7 cores into the process?

I was a bit confused about how to develop a program for balancing tasks between GPUs / CPUs, because they both finish tasks at different times, so I couldn’t just send them all to them all at the same time. I was thinking about setting up the queue (I'm new to c, so I'm not sure if this is still possible), but is there any way to find out when the GPU job is completed (since I thought sending jobs to Cuda was asynchronous)? The kernel is very similar to a regular c-function, so converting it to use cpu is not a problem, just balancing work seems to be a problem. I again went to "Cuda by example", but could not find anything like this balance.

Any suggestions would be great.

+4
source share
2 answers

I think the key is to create a multi-threaded application, following all common practices for this, and has two types of workflows. One that works with the GPU, and one that works with the processor. So you need a thread pool and a queue.

http://en.wikipedia.org/wiki/Thread_pool_pattern

The queue can be very simple. You can have one common integer, which is the index of the current line in the log file. When the thread is ready to get more work, it blocks this index, gets a certain number of lines from the log file, starting from the line indicated by the index, then increases the index by the number of lines it retrieves, and then unlocks.

When a worker thread executes with one fragment of a log file, it returns the results to the main thread and receives another fragment (or exits if no more lines are processed).

The application launches some combinations of GPU and processor workflows to use all available GPUs and processor cores.

One of the problems that you may encounter is that if the processor is busy, the performance of GPUs may suffer, as there are slight delays when sending new work or processing results from GPUs. You may need to experiment with the number of threads and their affinity. For example, you might need to reserve one processor core for each GPU, manipulating the affinity of threads.

+4
source

Since you say that line by line can be, you can divide the tasks into two different processes - One processor + GPU process One processor process that uses the remaining 7 cores

You can start each process with different offsets - for example, the first process reads lines 1-50, 101-150, etc., and the second - 51-100, 151-200, etc.

This avoids the headache in optimizing CPU-GPU interaction.

+1
source

Source: https://habr.com/ru/post/1415732/


All Articles