Is there any better way to synchronize all the computational shader calls?

I will implement the algorithm as something below in a computational shader

  • for each pixel in the image
    • compute something and save it on a temporary image
  • for each pixel in the image
    • Wait until its entire 8 neighboring pixel completes step 1
    • read data from a temporary image corresponding to its 8 neighboring pixel
    • use them to calculate the result

My workgroup settings layout (local_size_x = 256) in; And glDispatchCompute(1, 256, 1);
Before reading the temporary image in step 2, each pixel requires all its neighbors to complete step 1. Therefore, I put the Barrier () memory between steps 1 and 2, since the OpenGL Programming Guide, 8th Edition says memory barrier functions apply globally , and not just the same local work group.
But this does not work properly.

To demonstrate the result, consider a simplified but similar problem,

  • draw a black rectangle on a white image
  • for each pixel in the image
    • if it is black, save 1 for temporary image
    • else save 0 for temporary image
  • for each pixel in the image
    • if it is black or at least one of its 8 neighboring pixels is black, set it to black

This should make the black rectangle bigger and bigger. But as a result, the rectangle becomes uneven when it gets larger.

So, does memoryBarrier () really wait until all calls caused by the same glDispatchCompute call finish accessing their memory?

After I implement the lock between steps 2 and 3 , the result works as expected. (but later I discovered that sometimes this can lead to a program crash due to exceeding the Windows Time-Out limit! http://nvidia.custhelp.com/app/answers/detail/a_id/3007 )
(p is the current location, p + e [i] is the location of its 8 nearest pixels. Instead of image variables, I use the shader storage buffer object, so I add the posi () function to convert ivec2 to an array index)

 bool finished; do { finished = true; for(int i = 1; i < 9; i++) { if(!outOfBound(p+e[i]) && lock[posi(p+e[i])] != 1) { finished = false; } } }while(!finished); 

If I misunderstood memoryBarrier () and it cannot do what I want, is there a better way to synchronize the calls of the computational shader?

update to add computational shader code

Here is my computational shader code for the black rectangle example described above:
In fact, a tag is an image used to determine if a pixel is black or white, it is initialized by a small black rectangle on a white background. temp is set to zero before starting this compute shader. The commented code is about the lock described above. Using this lock, the shader will give the desired result.

 #version 430 core layout (local_size_x = 256) in; const ivec2 e[9] = { ivec2(0,0), ivec2(1,0), ivec2(0,1), ivec2(-1,0), ivec2(0,-1), ivec2(1,1), ivec2(-1,1), ivec2(-1,-1), ivec2(1,-1) }; layout(std430, binding = 14) coherent buffer tag_buff { int tag[]; }; layout(std430, binding = 15) coherent buffer temp_buff { int temp[]; }; layout(std430, binding = 16) coherent buffer lock_buff { int lock[]; }; int posi(ivec2 point) { return point.y * 256 + point.x; } bool outOfBound(ivec2 p) { return px < 0 || px >= 256 || py < 0 || py >= 256; } void main() { ivec2 p = ivec2(gl_GlobalInvocationID.xy); int x = tag[posi(p)]; temp[posi(p)] = x; //lock[posi(p)] = 1; memoryBarrier(); //bool finished; //do //{ // finished = true; // for(int i = 1; i < 9; i++) // { // if(!outOfBound(p+e[i]) && lock[posi(p+e[i])] != 1) // { // finished = false; // } // } //}while(!finished); // if it black or at least one of its 8 nearby pixel is black // set itself to black for(int i = 0; i < 9; i++) { if(!outOfBound(p+e[i]) && temp[posi(p+e[i])] == 1) { tag[posi(p)] = 1; } } } 

Later, I tried to save the lock in another ssbo after setting its elements to 1 and calling memoryBarrier (), and then loading the new ssbo into the fragment shader and printing it on the screen, from which I found that some lock element was not set to 1. I I also use the image variable instead of ssbo in the fragment shader or computational shader, only to find the memory Barrier and the coherent cannot change anything. It seems that memoryBarrier or coherent is not working.

After reading several materials, it seems that I know what is happening here, I state my understanding below. If this is not true, please correct me.

memoryBarrier cannot synchronize calls by synchronizing memory accesses. More specifically, what exactly memoryBarrier does is simply waiting for the completion of all memory accesses that have already occurred in calls . It will not wait for the completion of the memory access code that was not executed, even if it was before the memoryBarrier in the source code. The Opengl programming guide reported When memoryBarrier() is called, it ensures that any writes to memory that have been performed by the shader invocation have been committed to memory rather than lingering in caches or being scheduled after the call to memoryBarrier() . This means, for example, if it is assumed that there are three calls, if both calls A and B launched imageStore () for the coherent image variable, then the next memoryBarrier for A or B ensures that this image memoryBarrier () has changed the data in main memory, not only cache. But if the call to C did not start imageStore () when A or B calls the memoryBarrier , then this call to memoryBarrier will not wait until C launches its imageStore (). Therefore, memoryBarrier cannot help me implement the algorithm.

+7
synchronization opengl glsl
source share
1 answer

I ran into a similar problem. I am not an expert, but I believe that I have found a good solution.

You have correctly identified the memoryBarrier to ensure that previous entries are visible.

However, the memoryBarrier itself memoryBarrier almost useless, since it does not provide streamlined execution. Therefore, although you have a memoryBarrier , there may be calls that are completely completed before others even start working. memoryBarrier cannot make visible entries that have not yet occurred.

We have a barrier to fix this:

For any given static barrier instance in the computational shader, all calls within the same workgroup must enter it before anyone is allowed to continue it.

Pay attention to the emphasis: barrier allows not to synchronize workgroups inside one glDispatchCompute call, it is synchronized only inside workgroups.

Obviously, barrier does not help in your problem, so you presented your own barrier, which has disadvantages:

  • The compiler / driver / scheduler does not know that this is a barrier, so it cannot be optimized.
  • Your barrier uses a spin lock, which hogs the processor. This increases the operating time until the watchdog timer is activated.

If the driver knew about the barrier, he could plan for those calls that had not yet reached the barrier to launch. In your solution, the driver blindly plans all the calls, spending resources on those already waiting, instead of launching those that have not yet reached the barrier.

What to do instead?

Decision

To achieve a barrier to all calls, just do a few glDispatchCompute alternating with the corresponding calls to glMemoryBarrier .

Splitting glDispatchCompute into multiple calls creates a barrier between them. glMemoryBarrier makes records of previous calls visible to later ones.

+3
source share

All Articles