HLSL Migration Prevention

I have a shader where I want to move half the vertices in the vertex shader. I am trying to decide the best way to do this in terms of performance, because we are dealing with more than 100,000 vert, so speed is critical. I looked at 3 different methods: (pseudocode, but enough to give you an idea. <complex formula> I can’t give out, but I can say that it includes the sin() function and also the function call (just returns a number, but still a function call), as well as a bunch of basic arithmetic for floating point numbers).

 if (y < 0.5) { x += <complex formula>; } 

This has the advantage that the <complex formula> only takes half the time, but the disadvantage is that it definitely calls a branch, which can actually be slower than the formula. This is the most readable, but in this context we need speed more than readability.

 x += step(y, 0.5) * <complex formula>; 

Using the HLSL function step () (which returns 0 if the first parameter is greater and 1 if less), you can delete the branch, but now the <complex formula> is called every time, and its results are multiplied by 0 (thus wasted effort ) half the time.

 x += (y < 0.5) ? <complex formula> : 0; 

I don’t know about that. Does it cause ?: branch? And if not, then are both sides of the equation being evaluated or only relevant?

The final possibility is that the <complex formula> may be offloaded back to the CPU instead of the GPU, but I'm worried that it will be slower to compute sin () and other operations, which can lead to net losses. In addition, this means that another number must be passed to the shader, which can also cause overhead. Does anyone have an idea which one would be the best way?


Addendum:

According to http://msdn.microsoft.com/en-us/library/windows/desktop/bb509665%28v=vs.85%29.aspx

The step() function uses ?: Internally, so it is probably no better than my third solution, and potentially worse, since <complex formula> definitely called every time, whereas it can only be called half the time using the direct ?: . (No one has answered this part of the question yet.) Although to avoid both of them:

 x += (1.0 - y) * <complex formula>; 

may be better than any of them, since no comparison is made anywhere. (And y always either 0 or 1.) Performs a <complex formula> without the need for half the time, but may be worth it to avoid branches at all.

+7
source share
2 answers

Perhaps check out this answer .

My guess (this is a performance issue: measure it!) Is that you'd better leave an if .

Reason number one: the shader compiler, theoretically (and if called correctly), should be smart enough to make the best choice between a jump instruction and something similar to the step function when it compiles your if . The only way to improve it is with a profile [1] . Note that it probably depends on the equipment at this level of detail.

[1] Or, if you have specific knowledge about how your data is laid out, read on ...

Reason number two is how shader blocks work: if even one fragment or vertex in a block transfers another branch to another, then the shader block should take both branches. But if they all take one branch, the other branch is ignored. Thus, although this value is for a unit of measure, not a per-vertex, it is still possible to skip an expensive branch.

For fragments, shader units have on-screen locality - this means that you get better performance with groups of neighboring pixels that take the same branch (see the picture in my related answer ). Honestly, I don’t know how vertices are grouped into units, but if your data is grouped accordingly, you should get the desired performance.

Finally: It’s worth noting that your <complex formula> - if you say you can pull it out of your HLSL manually - in any case, it can be inserted into a processor-based preliminary shader (on a PC, at least from the Xbox 360 memory supports this, I don’t know about PS3). You can verify this by decompiling the shader. If this is what you only need to calculate once per draw (and not per vertex / fragment), then it is probably best to do this on the processor.

+7
source

I am tired of the fact that my conventions are ignored, so I just created another kernel and made an override in c-execution. If you need it to be accurate all the time, I suggest this fix.

0
source

All Articles