What is the maximum theoretical speed due to SSE for simple binary subtraction?

When trying to figure out if my internal code loop falls into the hardware design barrier or lack of understanding on my side of the barrier. There's a bit more, but the simplest question I can answer is this:

If I have the following code:

float px[32768],py[32768],pz[32768]; float xref, yref, zref, deltax, deltay, deltaz; initialize_with_random(px); initialize_with_random(py); initialize_with_random(pz); for(i=0;i<32768-1;i++) { xref=px[i]; yref=py[i]; zref=pz[i]; for(j=0;j<32768-1;j++ { deltx=xref-px[j]; delty=yref-py[j]; deltz=zref-pz[j]; } } 

What type of maximum theoretical speed could I see by going to SSE instructions in a situation where I have full control over the code (assembly, built-in functions, whatever), but not control over the runtime other than the architecture (i.e. a multi-user environment, so I can’t do anything about how the OS kernel assigns time to my specific process).

Now I see 3 times acceleration with my code when I would think that using SSE would give me much more vector depth than 3 times speed indicates (supposedly 3 times speed tells me that I have 4x maximum theoretical bandwidth ) (I tried things like letting deltx / delty / deltz be arrays in case the compiler wasn’t smart enough to automatically advance them, but I still only see 3x speedup.) I use the Intel C compiler with the appropriate flags compiler for vectorization, but obviously not internal.

+4
source share
4 answers

It depends on the processor. But the theoretical maximum will not exceed 4x. I do not know a CPU that can execute more than one SSE instruction per cycle, which means that it can calculate no more than 4 values ​​per cycle.

Most processors can execute at least one scalar floating-point instruction per cycle, so in this case you will see a theoretical maximum with an acceleration of 4 times.

But you will have to look for a specific bandwidth for the processor you are running on.

Practical 3x acceleration is not bad.

+4
source

I think you will probably have to somehow interleave the inner circuit. A three-component vector is executed immediately, but only three operations at once. To get to 4, you have to make 3 components from the first vector and 1 from the next, then 2 and 2 and so on. If you created some kind of queue that loads and processes the data for 4 components at a time, then separate them after it can work.

Edit: you can expand the inner loop to make 4 vectors per iteration (assuming the array size is always a multiple of 4). This will accomplish what I said above.

+2
source

Consider: how wide is the float? How wide is the SSEx instruction? The ratio should give you some reasonable upper bound.

It is also worth noting that custom pipes play hawoku with good acceleration ratings.

+1
source

You should consider the tiling loop - the way you access values ​​in the inner loop probably causes a lot of beats in the L1 cache data. This is not so bad because everything is probably still suitable for L2 at 384 Kbytes, but there is simply a difference in size between a hit of the L1 cache and a hit of the L2 cache, so this can make a big difference to you.

0
source

All Articles