Tiny SSE addpd loop a little slower than scalar on AMD Phenom II?

Yes, I read the SIMD code is slower than the scalar code . No, this is not a duplicate.

I have been using 2D maths for some time and in the process of porting my code base from C to C ++. There are several walls that I came across with C, which means I really need polymorphism, but this is another story. Anyway, I thought about it a while ago, but it provided an excellent opportunity to use a 2D vector class, including SSE implementations of general mathematical operations. Yes, I know that there are libraries there, but I wanted to try to understand what was going on myself, and I did not use anything more complicated than += .

My implementation is via <immintrin.h> , with

 union { __m128d ss; struct { double x; double y; } } 

SSE seemed slow, so I looked at its generated ASM output. After fixing something stupid pointer, I ended up using the following sets of instructions, running a billion times in a loop: (Processor - AMD Phenom II at 3.7 GHz)

SSE enabled: 1.1 to 1.8 seconds (varies)

 add $0x1, %eax addpd %xmm0, %xmm1 cmp $0x3b9aca00, %eax jne 4006c8 

SSE disabled: 1.0 seconds (fairly constant)

 add $0x1, %eax addsd %xmm0, %xmm3 cmp $0x3b9aca00, %eax addsd %xmm2, %xmm1 jne 400630 

The only conclusion I can draw from this is that addsd faster than addpd , and that pipelining means that additional instruction is offset by the ability to do faster things that partially overlap.

So my question is: is it worth it, and in practice will it really help, or am I just not going to worry about silly optimizations and letting the compiler process it in scalar mode?

+4
source share
3 answers

This requires more loop unwrapping and possibly caching. Your arithmetic density is very low: 1 operation for 2 memory operations, so you need to hush up as many of them in your pipeline as possible.

Also don't use union, but __m128d directly and use _mm_load_pd to populate your __m128 from your data. _m128 in the connection generates bad code, where the entire element runs the stop-register-stop tank, which is harmful.

+7
source

For record purposes only, Agner Fog confirms that K10 runs addpd and addsd with the same performance: 1 m-op for a FADD block with a 4-cycle delay. Previously, K8 had only 64-bit execution units and split addpd into two m-ops.

Thus, both loops have a chain of dependencies associated with the cycle loop. The scalar loop has two separate chains of segment 4c, but this only keeps the FADD block busy for half the time (instead of 1/4).

Other parts of the pipeline should come into play, possibly aligning the code or just the order of the instructions. AMD is more sensitive to this than Intel, IIRC. I'm not curious to learn about the K10 pipeline and find out if there is an explanation in the Agner Fog docs.

K10 does not merge cmp / jcc into one operand, so splitting them up is not really a problem. (Bulldozer-family processors do, and, of course, Intel).

+2
source

2D math is not so intense in the processor (compared to 3D math), so I doubt very much that it is worth spending a lot of time on this. Is it worth optimizing if

  • Your profiler says the code is a hot spot.
  • Your code is slow. (I think this is for the game?)
  • You have already optimized high-level algorithms.

I performed some SSE tests on my installations (AMD APU @ 3GHz x 4, the old Intel 1.8Ghz x 2 processor) and found that SSE would benefit most of the cases I tested. However, this was for 3D operations, not 2D.

Scalar code has more features for parallelism iirc. Four registers are used instead of two; fewer dependencies. If there is more competition in the register, vectorized code may work better. Take it with salt, but I have not experienced it.

+1
source

All Articles