Confusingly, int _mm_extract_ps() not for getting a scalar float element from a vector. The inner shell does not reveal the destination form for the destination in memory (which may be useful for this purpose). This is not the only case when intrinsics cannot directly express everything that is useful for the instruction. :(
gcc and clang know how the asm command works and will use it for you when compiling other tasses; the type-punning result of _mm_extract_ps on a float usually results in a terrible asm from gcc ( extractps eax, xmm0, 2 / mov [mem], eax ).
The name makes sense if you think of _mm_extract_ps as extracting the IEEE 754 binary bit of 3232 bit from the CPU FP domain into an integer domain (like a C int scalar), instead of manipulating FP bit patterns with whole vector operations. According to my testing with gcc, clang and icc (see below), this is the only "portable" use case where _mm_extract_ps compiles to good asm for all compilers . Everything else is just the hack compiler you need.
Corresponding instructions asm EXTRACTPS r/m32, xmm, imm8 . Note that the destination may be a memory or an integer register, but not another XMM register. This is the equivalent of FP PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer register-address form is more useful. EXTRACTS INSERTPS xmm1, xmm2/m32, imm8 .
Perhaps this similarity with PEXTRD simplifies the internal implementation, without prejudice to the use cases for memory retrieval (for asm, but not for internal ones), or maybe the SSE4.1 designers at Intel thought it was really more useful than as non-destructive FP -Domain for copying and shuffling (which x86 is seriously missing without AVX). There are FP-vector instructions that have an XMM source and memory assignment or xmm, for example MOVSS xmm2/m32, xmm , so this instruction will not be new. An interesting fact: the operation codes for PEXTRD and EXTRACTPS differ only in the last bit.
In an assembly, a scalar float is just a low XMM register element (or 4 bytes in memory). The top XMM elements do not even need to be nullified for instructions such as ADDSS to work without additional additional FP exceptions. When calling conventions that pass / return FP arguments to XMM registers (for example, all regular x86-64 ABIs), float foo(float a) should assume that the top elements of XMM0 contain garbage when writing, but may leave garbage in high XMM0 elements to return. ( Details ).
As @doug points out , other shuffle commands can be used to get the vector float at the bottom of the xmm register. This was already a problem with most SSE1 / SSE2 , and it seems EXTRACTPS and INSERTPS did not try to solve it for register operands.
SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in register, and they cannot / do not optimize by zeroing the top elements. ( In most cases for compilers except clang ). This related issue also discusses the failure of the built-in functions to load or save versions of instructions, such as EXTRACTPS, INSERTPS and PMOVZX, which have a memory operand narrower than 128b (which does not require alignment even without AVX). It is not possible to write safe code that compiles as efficiently as what you can do in asm.
Without the AVX 3-operand SHUFPS, x86 does not provide a fully efficient and universal way to copy and drag an FP vector as an integer PSHUFD . SHUFPS is another beast if it is not used in place with src = dst. To save the original, MOVAPS is required, which stands for the processor and the latency of the processors to IvyBridge, and the code size is always worth it. Using PSHUFD between FP commands requires a delay (bypass delays). (See this horizontal sum report for some tricks, for example, using SSE3 MOVSHDUP).
SSE4.1 INSERTPS can extract one element in a separate register, but AFAIK still has a dependency on the previous destination value, even if all the original values are replaced. False dependencies like these are bad for execution out of order. The xor-zeroing register as the destination for INSERTPS will still be 2 uops and has a lower delay than MOVAPS + SHUFPS on the SSE4.1 CPU without exception mov for zero-delay MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you use low power processors). However, the code size is slightly worse.
Using _mm_extract_ps and then enter-punning the result back into the float (as suggested in the currently accepted answer and its comments) is a bad idea. It is easy for your code to compile something terrible (for example, EXTRACTPS into memory, and then load it back into the XMM register) on gcc or icc. Clang is apparently immune to the behavior of the braindead and makes its usual compilation in the form of shuffle with its own choice of instructions in random order (including the appropriate use of EXTRACTS).
I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell , clang3.8.1 and icc17, in the Godbolt compiler explorer . I used C mode, not C ++, but in GNU C ++ it is permissible to use a pool based on union as an extension for ISO C ++. Assigning a pointer to the punning type violates the strict alias in C99 and C ++, even with GNU extensions.
#include <immintrin.h> // gcc:bad clang:good icc:good void extr_unsafe_ptrcast(__m128 v, float *p) { // violates strict aliasing *(int*)p = _mm_extract_ps(v, 2); } gcc:
If you want to get the final result in the xmm register, then before the compiler you can optimize your extracts and do something completely different. Gcc and clang are both successful, but ICC does not.
// gcc:good clang:good icc:bad float ret_pun(__m128 v) { union floatpun { int i; float f; } fp; fp.i = _mm_extract_ps(v, 2); return fp.f; } gcc: unpckhps xmm0, xmm0 clang: shufpd xmm0, xmm0, 1 icc17: vextractps DWORD PTR [-8+rsp], xmm0, 2 vmovss xmm0, DWORD PTR [-8+rsp]
Please note that icc also works poorly for extr_pun , so for him it is not like that.
The clear winner here is doing “manually” using _mm_shuffle_ps(v,v, 2) and using _mm_cvtss_f32 . . We got the optimal code from each compiler for both registers and memory points, with the exception of ICC, which EXTRACTPS could not use for the memory-dest case. With AVX, SHUFPS + a separate store, there are still only 2 processors on Intel processors, only a larger code size and needs a tmp register. However, without AVX, it would be worth MOVAPS not to destroy the original vector: /
According to the Agner Fog instruction tables , all Intel processors except Nehalem implement registry versions of both PEXTRD and EXTRACTPS with several uops: Usually just shuffle uop + MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS - 1 uop for port 5, with a delay of 1 + 2 cycles (1 + bypass delay).
I have no idea why they managed to implement EXTRACTPS as a single uop, but not PEXTRD (which is 2 uops and works with a 2 + 1 cycle delay). Nehalem MOVD is 1 mcp (and works on any ALU port), with a delay of 1 + 1 cycle. (I think +1 to delay bypass between vec-int and integer integer integer targets).
Nehalem cares a lot about vector FPs and whole domains; Processors in the SnB family have shorter (sometimes zero) delays in crawl delays between domains.
The memory modes PSTTRD and EXTRACTPS with memory-dest are simultaneously 2 uops on Nehalem.
In Broadwell and later versions of EXTRACTPS and PEXTRD, memory assignments are 2 times, but on Sandybridge via Haswell, EXTRACTPS memory assignments are 3 times. PEXTRD destination memory is 2 peaks on everything except Sandybridge, where it is 3. This seems weird, and Agner Fog tables sometimes have errors, but it's possible. Micro-fusion does not work with some instructions on some microarchitectures.
If any command turned out to be extremely useful for something important (for example, inside internal loops), CPU developers would create execution units that could do it all as one uop (or, possibly, 2 for memory-dest). But this potentially requires more bits in the internal uop format (which is simplified by Sandybridge).
Fun fact: _mm_extract_epi32(vec, 0) compiles (for most compilers) the value of movd eax, xmm0 , which is shorter and faster than pextrd eax, xmm0, 0 .
Interestingly, they perform differently on Nehalem (which cares a lot about vector FPs and entire domains and came out shortly after SSE4.1 was introduced in Penryn (45 nm Core2)). EXTRACTS with the destination of the register 1 microprocessor, with a delay of 1 + 2 cycles (+2 from the bypass delay between the FP and the integer domain). PEXTRD - 2 times and works with a delay of 2 + 1 cycles.