Intel SSE: Why does `_mm_extract_ps` return` int` instead of `float`?

Why _mm_extract_ps return an int instead of a float ?

What is the correct way to read a single float from an XMM register in C?

Or, conversely, another way to ask the question: what is the opposite of _mm_set_ps ?

+7
source share
4 answers

From the MSDN docs , I believe that you can apply the result to a float.

Note from their example, the value 0xc0a40000 is equivalent to -5.125 (a.m128_f32 [1]).

Update: I highly recommend the answers from @ doug65536 and @PeterCordes (below) instead of mine, which seems to produce poorly executed code for many compilers.

+1
source

None of the answers answer the question of why it returns an int .

The reason is that the extractps command actually copies the vector component to the general register. It seems pretty dumb to have it return an int, but what actually happens is the original floating point value ends in a common register (which contains integers).

If your compiler is configured to create an SSE for all floating point operations, then the closest thing to “extracting” the value into the register should be to shuffle the value into the low component of the vector and then pass it to the float scalar. This should cause the vector component to remain in the SSE register:

 /* returns the second component of the vector */ float foo(__m128 b) { return _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2))); } 

The internal code _mm_cvtss_f32 free, it does not generate instructions, it only forces the compiler to _mm_cvtss_f32 xmm register as a float , so it can be returned as such.

_mm_shuffle_ps gets the desired value into the lowest component. The _MM_SHUFFLE macro creates an immediate operand for the resulting shufps instruction.

In Example 2 receives a float from bit 95:64 of register 127: 0 (the third 32-bit component from the beginning in memory order) and places it in component 31: 0 of the register (beginning, in memory order).

The resulting generated code is likely to return the value, naturally, to the register, like any other return value with a floating point, without inefficient writing to memory and reading it.

If you create code that uses the x87 FPU for floating point (for regular C code that is not SSE optimized), this will probably lead to inefficient code - the compiler will probably save the SSE component, then use fld to read it back into the x87 register stack. Typically, 64-bit platforms do not use x87 (they use SSE for all floating points, mostly scalar instructions, if the compiler does not perform vectorization).

I have to add that I always use C ++, so I'm not sure it is more efficient to pass __m128 by value or pointer to C. In C ++, I would use const __m128 & , and this kind of code would be in the header, so the compiler can embed.

+17
source

Confusingly, int _mm_extract_ps() not for getting a scalar float element from a vector. The inner shell does not reveal the destination form for the destination in memory (which may be useful for this purpose). This is not the only case when intrinsics cannot directly express everything that is useful for the instruction. :(

gcc and clang know how the asm command works and will use it for you when compiling other tasses; the type-punning result of _mm_extract_ps on a float usually results in a terrible asm from gcc ( extractps eax, xmm0, 2 / mov [mem], eax ).

The name makes sense if you think of _mm_extract_ps as extracting the IEEE 754 binary bit of 3232 bit from the CPU FP domain into an integer domain (like a C int scalar), instead of manipulating FP bit patterns with whole vector operations. According to my testing with gcc, clang and icc (see below), this is the only "portable" use case where _mm_extract_ps compiles to good asm for all compilers . Everything else is just the hack compiler you need.


Corresponding instructions asm EXTRACTPS r/m32, xmm, imm8 . Note that the destination may be a memory or an integer register, but not another XMM register. This is the equivalent of FP PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer register-address form is more useful. EXTRACTS INSERTPS xmm1, xmm2/m32, imm8 .

Perhaps this similarity with PEXTRD simplifies the internal implementation, without prejudice to the use cases for memory retrieval (for asm, but not for internal ones), or maybe the SSE4.1 designers at Intel thought it was really more useful than as non-destructive FP -Domain for copying and shuffling (which x86 is seriously missing without AVX). There are FP-vector instructions that have an XMM source and memory assignment or xmm, for example MOVSS xmm2/m32, xmm , so this instruction will not be new. An interesting fact: the operation codes for PEXTRD and EXTRACTPS differ only in the last bit.


In an assembly, a scalar float is just a low XMM register element (or 4 bytes in memory). The top XMM elements do not even need to be nullified for instructions such as ADDSS to work without additional additional FP exceptions. When calling conventions that pass / return FP arguments to XMM registers (for example, all regular x86-64 ABIs), float foo(float a) should assume that the top elements of XMM0 contain garbage when writing, but may leave garbage in high XMM0 elements to return. ( Details ).

As @doug points out , other shuffle commands can be used to get the vector float at the bottom of the xmm register. This was already a problem with most SSE1 / SSE2 , and it seems EXTRACTPS and INSERTPS did not try to solve it for register operands.


SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in register, and they cannot / do not optimize by zeroing the top elements. ( In most cases for compilers except clang ). This related issue also discusses the failure of the built-in functions to load or save versions of instructions, such as EXTRACTPS, INSERTPS and PMOVZX, which have a memory operand narrower than 128b (which does not require alignment even without AVX). It is not possible to write safe code that compiles as efficiently as what you can do in asm.

Without the AVX 3-operand SHUFPS, x86 does not provide a fully efficient and universal way to copy and drag an FP vector as an integer PSHUFD . SHUFPS is another beast if it is not used in place with src = dst. To save the original, MOVAPS is required, which stands for the processor and the latency of the processors to IvyBridge, and the code size is always worth it. Using PSHUFD between FP commands requires a delay (bypass delays). (See this horizontal sum report for some tricks, for example, using SSE3 MOVSHDUP).

SSE4.1 INSERTPS can extract one element in a separate register, but AFAIK still has a dependency on the previous destination value, even if all the original values ​​are replaced. False dependencies like these are bad for execution out of order. The xor-zeroing register as the destination for INSERTPS will still be 2 uops and has a lower delay than MOVAPS + SHUFPS on the SSE4.1 CPU without exception mov for zero-delay MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you use low power processors). However, the code size is slightly worse.


Using _mm_extract_ps and then enter-punning the result back into the float (as suggested in the currently accepted answer and its comments) is a bad idea. It is easy for your code to compile something terrible (for example, EXTRACTPS into memory, and then load it back into the XMM register) on gcc or icc. Clang is apparently immune to the behavior of the braindead and makes its usual compilation in the form of shuffle with its own choice of instructions in random order (including the appropriate use of EXTRACTS).

I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell , clang3.8.1 and icc17, in the Godbolt compiler explorer . I used C mode, not C ++, but in GNU C ++ it is permissible to use a pool based on union as an extension for ISO C ++. Assigning a pointer to the punning type violates the strict alias in C99 and C ++, even with GNU extensions.

 #include <immintrin.h> // gcc:bad clang:good icc:good void extr_unsafe_ptrcast(__m128 v, float *p) { // violates strict aliasing *(int*)p = _mm_extract_ps(v, 2); } gcc: # others extractps with a memory dest extractps eax, xmm0, 2 mov DWORD PTR [rdi], eax ret // gcc:good clang:good icc:bad void extr_pun(__m128 v, float *p) { // union type punning is safe in C99 (and GNU C and GNU C++) union floatpun { int i; float f; } fp; fp.i = _mm_extract_ps(v, 2); *p = fp.f; // compiles to an extractps straight to memory } icc: vextractps eax, xmm0, 2 mov DWORD PTR [rdi], eax ret // gcc:good clang:good icc:horrible void extr_gnu(__m128 v, float *p) { // gcc uses extractps with a memory dest, icc does extr_store *p = v[2]; } gcc/clang: extractps DWORD PTR [rdi], xmm0, 2 icc: vmovups XMMWORD PTR [-24+rsp], xmm0 mov eax, DWORD PTR [-16+rsp] # reload from red-zone tmp buffer mov DWORD PTR [rdi], eax // gcc:good clang:good icc:poor void extr_shuf(__m128 v, float *p) { __m128 e2 = _mm_shuffle_ps(v,v, 2); *p = _mm_cvtss_f32(e2); // gcc uses extractps } icc: (others: extractps right to memory) vshufps xmm1, xmm0, xmm0, 2 vmovss DWORD PTR [rdi], xmm1 

If you want to get the final result in the xmm register, then before the compiler you can optimize your extracts and do something completely different. Gcc and clang are both successful, but ICC does not.

 // gcc:good clang:good icc:bad float ret_pun(__m128 v) { union floatpun { int i; float f; } fp; fp.i = _mm_extract_ps(v, 2); return fp.f; } gcc: unpckhps xmm0, xmm0 clang: shufpd xmm0, xmm0, 1 icc17: vextractps DWORD PTR [-8+rsp], xmm0, 2 vmovss xmm0, DWORD PTR [-8+rsp] 

Please note that icc also works poorly for extr_pun , so for him it is not like that.

The clear winner here is doing “manually” using _mm_shuffle_ps(v,v, 2) and using _mm_cvtss_f32 . . We got the optimal code from each compiler for both registers and memory points, with the exception of ICC, which EXTRACTPS could not use for the memory-dest case. With AVX, SHUFPS + a separate store, there are still only 2 processors on Intel processors, only a larger code size and needs a tmp register. However, without AVX, it would be worth MOVAPS not to destroy the original vector: /


According to the Agner Fog instruction tables , all Intel processors except Nehalem implement registry versions of both PEXTRD and EXTRACTPS with several uops: Usually just shuffle uop + MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS - 1 uop for port 5, with a delay of 1 + 2 cycles (1 + bypass delay).

I have no idea why they managed to implement EXTRACTPS as a single uop, but not PEXTRD (which is 2 uops and works with a 2 + 1 cycle delay). Nehalem MOVD is 1 mcp (and works on any ALU port), with a delay of 1 + 1 cycle. (I think +1 to delay bypass between vec-int and integer integer integer targets).

Nehalem cares a lot about vector FPs and whole domains; Processors in the SnB family have shorter (sometimes zero) delays in crawl delays between domains.

The memory modes PSTTRD and EXTRACTPS with memory-dest are simultaneously 2 uops on Nehalem.

In Broadwell and later versions of EXTRACTPS and PEXTRD, memory assignments are 2 times, but on Sandybridge via Haswell, EXTRACTPS memory assignments are 3 times. PEXTRD destination memory is 2 peaks on everything except Sandybridge, where it is 3. This seems weird, and Agner Fog tables sometimes have errors, but it's possible. Micro-fusion does not work with some instructions on some microarchitectures.

If any command turned out to be extremely useful for something important (for example, inside internal loops), CPU developers would create execution units that could do it all as one uop (or, possibly, 2 for memory-dest). But this potentially requires more bits in the internal uop format (which is simplified by Sandybridge).

Fun fact: _mm_extract_epi32(vec, 0) compiles (for most compilers) the value of movd eax, xmm0 , which is shorter and faster than pextrd eax, xmm0, 0 .

Interestingly, they perform differently on Nehalem (which cares a lot about vector FPs and entire domains and came out shortly after SSE4.1 was introduced in Penryn (45 nm Core2)). EXTRACTS with the destination of the register 1 microprocessor, with a delay of 1 + 2 cycles (+2 from the bypass delay between the FP and the integer domain). PEXTRD - 2 times and works with a delay of 2 + 1 cycles.

+4
source

Try _mm_storeu_ps or any of the SSE store operation options.

+1
source

All Articles