I decided it was time to use a quick square root response. So, I tried to write a function (which would be marked inline during production):
float sqrt_recip(float x) { return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ps1(x) ) );
TL my question is: "How can I get GCC and ICC to output the minimum assembly (two instructions) for the above function, preferably without resorting to the raw assembly (adhering to the built-in functions)?"
As written, on ICC 13.0.1, GCC 5.2.0 and Clang 3.7, the output is:
shufps xmm0, xmm0, 0 rsqrtss xmm0, xmm0 ret
This makes sense since I used _mm_set_ps1 to scatter x in all components of the register. But I really don't need to do this. I would prefer to do only the last two lines. Of course, shufps is just one loop. But rsqrtss only three to five. This is from 20% to 33% of overhead, which is completely useless.
Some things I tried:
I tried just not to install it:
union { __m128 v; float f[4]; } u;
uf[0] = x;
return _mm_cvtss_f32(_mm_rsqrt_ss(uv));
It really works for Clang, but the output for ICC and GCC in particular is terrible.
Instead of scattering, you can fill with zeros (i.e. use _mm_set_ss ). Again, neither GCC nor ICC output are optimal. In the case of GCC, GCC fun adds:
movss DWORD PTR [rsp-12], xmm0
movss xmm0, DWORD PTR [rsp-12]
source share