Getting the smallest instructions for the rsqrtss wrapper

I decided it was time to use a quick square root response. So, I tried to write a function (which would be marked inline during production):

 float sqrt_recip(float x) { return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ps1(x) ) ); //same as _mm_set1_ps } 

TL my question is: "How can I get GCC and ICC to output the minimum assembly (two instructions) for the above function, preferably without resorting to the raw assembly (adhering to the built-in functions)?"

As written, on ICC 13.0.1, GCC 5.2.0 and Clang 3.7, the output is:

 shufps xmm0, xmm0, 0 rsqrtss xmm0, xmm0 ret 

This makes sense since I used _mm_set_ps1 to scatter x in all components of the register. But I really don't need to do this. I would prefer to do only the last two lines. Of course, shufps is just one loop. But rsqrtss only three to five. This is from 20% to 33% of overhead, which is completely useless.


Some things I tried:

  • I tried just not to install it:
    union { __m128 v; float f[4]; } u;
    uf[0] = x;
    return _mm_cvtss_f32(_mm_rsqrt_ss(uv));
    It really works for Clang, but the output for ICC and GCC in particular is terrible.

  • Instead of scattering, you can fill with zeros (i.e. use _mm_set_ss ). Again, neither GCC nor ICC output are optimal. In the case of GCC, GCC fun adds:
    movss DWORD PTR [rsp-12], xmm0
    movss xmm0, DWORD PTR [rsp-12]


+5
source share

All Articles