Transition Overlap and Device Calculation in OpenCL

I start with OpenCL, and it's hard for me to understand something. I want to improve image transmission between the host and the device. I developed a scheme to better understand me.

Top: what I have now | Below: what I want HtD (Host to Device) and DtH (device for the host) is memory transfer. K1 and K2 are the cores.

I was thinking about using matching memory, but the first transfer (Host to Device) is done using the clSetKernelArg () command, no? Or do I need to cut the input image into a sub-image and use the mapping to get the output image?

Thanks.

Edit: Additional Information

K1 input / output image. The output image of the process K2 from K1.

So, I want to transfer MemInput to several parts for K1. And I want to read and save on the host MemOuput processed by K2.

+6
source share
4 answers

As you may have already seen, you are transferring from the host to the device using clEnqueueWriteBuffer and the like.

All commands that have the "enqueue" keyword in them have a special property: the commands are not executed directly, but when you simulate them using clFinish , clFlush , clEnqueueWaitForEvents , using clEnqueueWriteBuffer when locking, and a few more.

This means that all actions are performed immediately, and you must synchronize it using event objects. Since everything (possibly) will happen immediately, you can do something like this (each point happens simultaneously):

  • Data Transfer A
  • Process A data and data transfer B
  • Process B data and data transfer C and data Ret A '
  • Process C data and data extraction B '
  • Get data C '

Remember: starting tasks without event objects can lead to the simultaneous execution of all elements in the queue!

To make sure Process Data B did not happen before passing B, you need to get the event object from clEnqueueWriteBuffer and provide it as an object waiting for fi clEnqueueNDRangeKernel

 cl_event evt; clEnqueueWriteBuffer(... , bufferB , ... , ... , ... , bufferBdata , NULL , NULL , &evt); clEnqueueNDRangeKernel(... , kernelB , ... , ... , ... , ... , 1 , &evt, NULL); 

Instead of supplying NULL, each command can, of course, wait for certain objects and generate a new event object. The parameter next to the last one is an array, so you can expect events for several events!


EDIT: to summarize the comments below Data transfer - which command acts where?
  CPU GPU
                             Bufa bufb
 array [] = {...}
 clCreateBuffer () -----> [] // Create (empty) Buffer in GPU memory *
 clCreateBuffer () -----> [] [] // Create (empty) Buffer in GPU memory *
 clWriteBuffer () -arr-> [ array ] [] // Copy from CPU to GPU
 clCopyBuffer () [array] -> [ array ] // Copy from GPU to GPU
 clReadBuffer () <-arr- [array] [array] // Copy from GPU to CPU

* You can initialize the buffer directly by providing data using the host_ptr parameter.

+5
source

Many OpenCL platforms do not support queues of commands out of turn; since most sellers talk about DMA duplication and computation, this is the use of multiple (in order) command queues. You can use events to ensure that dependencies are executed in the correct order. NVIDIA has a sample code that shows an overlapping DMA and computes it that way (although it is suboptimal, it can go a little faster than they say).

+3
source

When you create a command queue, you need to enable custom execution in your properties. see CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, clCreateCommandQueue .

This will allow you to customize your smaller task chains and link them together. All this is done on the host.

host pseudo-code:

 for i in taskChainList enqueueWriteDataFromHost enqueueKernel(K1) enqueueKernel(K2) enqueueReadFromDevice clfinish 

When you queue tasks, put the previous cl_event in each event_wait_list task. The "EnqueueWriteDataFromHost" that I have above should not wait for another event to start.

On the other hand,

 cl_event prevWriteEvent; cl_event newWriteEvent; for i in taskChainList enqueueWriteDataFromHost // pass *prevWriteEvent as the event_wait_list, and update with newWriteEvent that the enqueue function produces. Now each Write will wait on the one before it. enqueueKernel(K1) enqueueKernel(K2) enqueueReadFromDevice //The reads shouldn't come back out of order, but they could (if the last block of processing were much faster then the 2nd-last for example) clfinish 
+2
source

The right way (like me, and works great) is to create 2 queues of commands, one for I / O and another for processing. Both must be in the same context.

You can use events to control the schedule of both queues, and operations will be executed in parallel (if they can). Even if the device does not support outoforderqueue, it really works.

For example, you can queue all 100 images in the I / O queue on the GPU and get their events. Then set this event as a trigger for the cores. DtoH transmission is triggered by kernel events. Even if you run all of these AT ONCE jobs, they will be processed in order and with parallel I / O.

+2
source

Source: https://habr.com/ru/post/925221/


All Articles