How to get data from AVX registers?

Using MSVC 2013 and AVX 1, I have 8 floats in register:

__m256 foo = mm256_fmadd_ps(a,b,c); 

Now I want to call inline void print(float) {...} for all 8 floats. Intel AVX intrigues seem to make this pretty complicated:

 print(_castu32_f32(_mm256_extract_epi32(foo, 0))); print(_castu32_f32(_mm256_extract_epi32(foo, 1))); print(_castu32_f32(_mm256_extract_epi32(foo, 2))); // ... 

but MSVC does not even have either of these two properties. Of course, I could write the values ​​to memory and load from there, but I suspect that at the assembly level there is no need to spill the register.

Bonus Q: Of course I would like to write

 for(int i = 0; i !=8; ++i) print(_castu32_f32(_mm256_extract_epi32(foo, i))) 

but MSVC does not realize that many functions require looping. How to write a loop over 8x32 floats in __m256 foo ?

+9
c ++ visual-c ++ avx fma
source share
4 answers

Caution: _mm256_fmadd_ps not part of AVX1. FMA3 has its own function bit and was introduced only on Intel with Haswell. AMD introduced FMA3 with Piledriver (AVX1 + FMA4 + FMA3, without AVX2).


At the asm level, if you want to get eight 32-bit elements into integer registers, it’s actually faster to store on the stack and then perform scalar loads. pextrd is a 2-instruction manual for the SnB family and the Bulldozer family. (and Nehalem and Silvermont, which do not support AVX).

The only CPU where vextractf128 + 2x movd + 6x pextrd not scary is AMD Jaguar. (A cheap pextrd and only one load port.) (See Agner Fog insn tables )

A wide, leveled storage can go to cover narrow loads. (Of course, you can use movd to get the low element, so you have a connection between the boot port and the ALU port).


Of course, you seem to retrieve the float using integer retrieval, and then convert it back to float. It seems awful.

What you really need is each float in the bottom element of your own xmm register. vextractf128 is obviously a way to get started by bringing item 4 to the bottom of the new xmm reg. Then 6x AVX shufps can easily get the other three elements of each half. (Or movshdup and movhlps have shorter encodings: immediate byte).

7 shuffle uops deserve consideration in comparison with 1 store and 7 downloads, but not if you are still going to spill a vector to call a function.


ABI recommendations:

You are on Windows, where xmm6-15 are stored in codes (only low128, the upper halves of ymm6-15 are knocked out by a call). This is another reason to start with vextractf128 .

In SysV ABI, all xmm / ymm / zmm registers are called by a call, so each print() function requires a spill / reload. The only thing to do is to save the memory in memory and call print with the original vector (i.e. print bottom element, because it will ignore the rest of the register). Then movss xmm0, [rsp+4] and call print for the second element, etc.

It’s not good for you to get all 8 floats nicely unpacked into 8 vector regs, because all of them must be poured separately before the first function call!

+3
source share

Assuming you only have AVX (i.e. no AVX2), you can do something like this:

 float extract_float(const __m128 v, const int i) { float x; _MM_EXTRACT_FLOAT(x, v, i); return x; } void print(const __m128 v) { print(extract_float(v, 0)); print(extract_float(v, 1)); print(extract_float(v, 2)); print(extract_float(v, 3)); } void print(const __m256 v) { print(_mm256_extractf128_ps(v, 0)); print(_mm256_extractf128_ps(v, 1)); } 

However, I think I would probably just use a union:

 union U256f { __m256 v; float a[8]; }; void print(const __m256 v) { const U256f u = { v }; for (int i = 0; i < 8; ++i) print(ua[i]); } 
+3
source share

(Unfinished answer. Posting anyway in case it helps anyone, or in case I come back to it. Usually, if you need to interact with a scalar that you cannot vectorize, it's nice to just save the vector to a local array, and then reload it one item at a time .)


See my other answer for details asm. This answer is about the C ++ side of things.


Using the Agner Fog Vector Class Library , its wrapper classes overload operator[] to work exactly as you expected, constant arguments. This often compiles in storage / reboot, but makes writing C ++ code easier. With optimization turned on, you are likely to get decent results. (except that the bottom item can be saved / reloaded, instead of just being used in place. So you might need a special case of vec[0] in _mm_cvtss_f32(vec) or something like that.)

See also my github repo with most of the unverified changes in Agner VCL to generate better code for some features.


There is _MM_EXTRACT_FLOAT shell macro , but it is strange and is defined only by SSE4.1. I think it is going to go with SSE4.1 extractps (which can extract a binary representation of a float into the register of an integer or store it in memory). It gcc compiles it into an FP shuffle if the destination is float . Be careful if other compilers do not compile it into a valid extractps , if you want to get the result as a float , because that is not what extractps . (This is what insertps does , but a simpler FP shuffle will accept fewer bytes of commands, for example shufps with AVX is fine.)

This is strange because it takes 3 arguments: _MM_EXTRACT_FLOAT(dest, src_m128, idx) , so you cannot even use it as an initializer for a local float .


To iterate over a vector

gcc has developed such a loop for you, but only with -O1 or higher. An -O0 message will appear in -O0 .

 float bad_hsum(__m128 & fv) { float sum = 0; for (int i=0 ; i<4 ; i++) { float f; _MM_EXTRACT_FLOAT(f, fv, i); // works only with -O1 or higher sum += f; } return sum; } 
+1
source share
 float valueAVX(__m256 a, int i){ float ret = 0; switch (i){ case 0: ret = _mm_cvtss_f32(_mm256_extractf128_ps(a, 0)); break; case 1: { __m128 lo = _mm256_extractf128_ps(a, 0); ret = _mm_cvtss_f32(_mm_shuffle_ps(lo, lo, 1)); } break; case 2: { __m128 lo = _mm256_extractf128_ps(a, 0); ret = _mm_cvtss_f32(_mm_movehl_ps(lo, lo)); } break; case 3: { __m128 lo = _mm256_extractf128_ps(a, 0); __m128 tw = _mm_movehl_ps(lo, lo); ret = _mm_cvtss_f32(_mm_shuffle_ps(tw, tw, 1)); } break; case 4: ret = _mm_cvtss_f32(_mm256_extractf128_ps(a, 1)); break; case 5: { __m128 hi = _mm256_extractf128_ps(a, 1); ret = _mm_cvtss_f32(_mm_shuffle_ps(hi, hi, 1)); } break; case 6: { __m128 hi = _mm256_extractf128_ps(a, 1); ret = _mm_cvtss_f32(_mm_movehl_ps(hi, hi)); } break; case 7: { __m128 hi = _mm256_extractf128_ps(a, 1); __m128 tw = _mm_movehl_ps(hi, hi); ret = _mm_cvtss_f32(_mm_shuffle_ps(tw, tw, 1)); } break; } return ret; } 
0
source share

All Articles