Correct way to write kernel functions in CUDA?

I am going to start converting the program that I wrote in CUDA in order to hopefully increase the processing speed.

Now, obviously, my old program performs many functions one by one, and I separated these functions from my main program and called everyone in order.

void main () { *initialization of variables* function1() function2() function3() print result; } 

These functions are inherently serial because funtion2 depends on the results of funtion1.

Ok, now I want to convert these functions to kernels and run tasks in parallel functions.

Is it as simple as rewriting each function in parallel, and then in my main program call each core one by one? Is it slower than necessary? For example, can I have my GPU perform the following parallel operation without returning to the CPU to initialize the next core?

Obviously, I will keep all the run-time variables in the GPU memory to limit the amount of data transfer, so should I even worry about the time it takes between kernel calls?

I hope this question is clear, if you do not ask me, specify. Thanks.

And here is an additional question so that I can check my judiciousness. Ultimately, this program input is a video file, and through various functions, each frame will produce a result. My plan is to capture several frames at a time (for example, 8 unique frames), and then divide the total number of blocks that I have among these 8 frames, and then several threads in the blocks will perform even more parallel data operations images such as vector addition, Fourier transform, etc.
Is this the right way to approach the problem?

+7
source share
3 answers

In some cases, you can run programs at full potential speed on a GPU with very few porting operations from a simple processor version, and this may be one of them.

If you have a function like this:

 void process_single_video_frame(void* part_of_frame) { // initialize variables ... intermediate_result_1 = function1(part_of_frame); intermediate_result_2 = function2(intermediate_result_1); intermediate_result_3 = function3(intermediate_result_2); store_results(intermediate_result_3); } 

and you can handle many part_of_frames at the same time. Say a few thousand

and function1() , function2() and function3() pass almost the same code paths (that is, the program flow does not depend much on the contents of the frame)

then local memory can do all the work for you. Local memory is a type of memory that is stored in global memory. It differs from the global memory in a subtle but deep way ... The memory is simply interlaced in such a way that adjacent threads will access neighboring 32-bit words, allowing full access to the memory if all the threads read from the same location of their local memory arrays .

The flow of your program will be that you start by copying part_of_frame to a local array and preparing other local arrays for intermediate results. Then you pass pointers to local arrays between the various functions of your code.

Some pseudo codes:

 const int size_of_one_frame_part = 1000; __global__ void my_kernel(int* all_parts_of_frames) { int i = blockIdx.x * blockDim.x + threadIdx.x; int my_local_array[size_of_one_frame_part]; memcpy(my_local_array, all_parts_of_frames + i * size_of_one_frame_part); int local_intermediate_1[100]; function1(local_intermediate_1, my_local_array); ... } __device__ void function1(int* dst, int* src) { ... } 

Thus, this approach can allow you to use your processor functions almost unchanged, since parallelism does not arise due to the creation of parallel versions of your functions, but instead through parallel operation of the entire chain of functions. And this again became possible thanks to hardware support for memory striping in local arrays.

Notes:

  • The original copy of part_of_frame from global to local memory is not merged, but hopefully you will have enough computation to hide it.

  • On devices with computing power <= 1.3, only 16 Kbytes of local memory is available for the stream, which may not be enough for your part_of_frame and other intermediate data. But with computational ability> = 2.0, it expanded to 512KiB, which should be a lot.

+6
source

Answering some of your questions:

Calling the kernel is not so expensive, so don't be afraid of the program flow returned from the GPU to the CPU. As long as you save your results in the GPU memory, there will be no overhead. If you want, you can create a kernel that simply calls the other functions of the device in sequence. AFAIK, it will be more difficult to debug the profile, I’m not sure that you can even profile functions called by the kernel.

Regarding parallelization:

I think that any idea that allows you to run calculations across multiple data streams is good. The more your code looks like a shader, the better (that is, it will have the necessary characteristics for quick launch on gpu). A multi-frame idea is good. Some hints at this: minimize synchronization as much as possible, access memory as little as possible, try increasing the ratio of computation time to I / O request time, use gpu / shared memory registers, prefer many-read-from-one-one-write-many .

+5
source

If the GPU resources are enough to process 3 functions in one core, you can either put your functions in a large kernel, or you can run 3 kernels to run the functions separately. In terms of performance, it is slightly different, because starting the kernel has a slight hardware overhead, low software load.

However, if GPU resources are insufficient, the inclusion of three functions in one core can sacrifice performance. In this case, it is better to put each function in a separate kernel.

+1
source

All Articles