Yes, I read the SIMD code is slower than the scalar code . No, this is not a duplicate.
I have been using 2D maths for some time and in the process of porting my code base from C to C ++. There are several walls that I came across with C, which means I really need polymorphism, but this is another story. Anyway, I thought about it a while ago, but it provided an excellent opportunity to use a 2D vector class, including SSE implementations of general mathematical operations. Yes, I know that there are libraries there, but I wanted to try to understand what was going on myself, and I did not use anything more complicated than += .
My implementation is via <immintrin.h> , with
union { __m128d ss; struct { double x; double y; } }
SSE seemed slow, so I looked at its generated ASM output. After fixing something stupid pointer, I ended up using the following sets of instructions, running a billion times in a loop: (Processor - AMD Phenom II at 3.7 GHz)
SSE enabled: 1.1 to 1.8 seconds (varies)
add $0x1, %eax addpd %xmm0, %xmm1 cmp $0x3b9aca00, %eax jne 4006c8
SSE disabled: 1.0 seconds (fairly constant)
add $0x1, %eax addsd %xmm0, %xmm3 cmp $0x3b9aca00, %eax addsd %xmm2, %xmm1 jne 400630
The only conclusion I can draw from this is that addsd faster than addpd , and that pipelining means that additional instruction is offset by the ability to do faster things that partially overlap.
So my question is: is it worth it, and in practice will it really help, or am I just not going to worry about silly optimizations and letting the compiler process it in scalar mode?