As you may have already seen, you are transferring from the host to the device using clEnqueueWriteBuffer and the like.
All commands that have the "enqueue" keyword in them have a special property: the commands are not executed directly, but when you simulate them using clFinish , clFlush , clEnqueueWaitForEvents , using clEnqueueWriteBuffer when locking, and a few more.
This means that all actions are performed immediately, and you must synchronize it using event objects. Since everything (possibly) will happen immediately, you can do something like this (each point happens simultaneously):
- Data Transfer A
- Process A data and data transfer B
- Process B data and data transfer C and data Ret A '
- Process C data and data extraction B '
- Get data C '
Remember: starting tasks without event objects can lead to the simultaneous execution of all elements in the queue!
To make sure Process Data B did not happen before passing B, you need to get the event object from clEnqueueWriteBuffer and provide it as an object waiting for fi clEnqueueNDRangeKernel
cl_event evt; clEnqueueWriteBuffer(... , bufferB , ... , ... , ... , bufferBdata , NULL , NULL , &evt); clEnqueueNDRangeKernel(... , kernelB , ... , ... , ... , ... , 1 , &evt, NULL);
Instead of supplying NULL, each command can, of course, wait for certain objects and generate a new event object. The parameter next to the last one is an array, so you can expect events for several events!
EDIT: to summarize the comments below
Data transfer - which command acts where?
CPU GPU
Bufa bufb
array [] = {...}
clCreateBuffer () -----> [] // Create (empty) Buffer in GPU memory *
clCreateBuffer () -----> [] [] // Create (empty) Buffer in GPU memory *
clWriteBuffer () -arr-> [ array ] [] // Copy from CPU to GPU
clCopyBuffer () [array] -> [ array ] // Copy from GPU to GPU
clReadBuffer () <-arr- [array] [array] // Copy from GPU to CPU
* You can initialize the buffer directly by providing data using the host_ptr parameter.