Caution: _mm256_fmadd_ps not part of AVX1. FMA3 has its own function bit and was introduced only on Intel with Haswell. AMD introduced FMA3 with Piledriver (AVX1 + FMA4 + FMA3, without AVX2).
At the asm level, if you want to get eight 32-bit elements into integer registers, itβs actually faster to store on the stack and then perform scalar loads. pextrd is a 2-instruction manual for the SnB family and the Bulldozer family. (and Nehalem and Silvermont, which do not support AVX).
The only CPU where vextractf128 + 2x movd + 6x pextrd not scary is AMD Jaguar. (A cheap pextrd and only one load port.) (See Agner Fog insn tables )
A wide, leveled storage can go to cover narrow loads. (Of course, you can use movd to get the low element, so you have a connection between the boot port and the ALU port).
Of course, you seem to retrieve the float using integer retrieval, and then convert it back to float. It seems awful.
What you really need is each float in the bottom element of your own xmm register. vextractf128 is obviously a way to get started by bringing item 4 to the bottom of the new xmm reg. Then 6x AVX shufps can easily get the other three elements of each half. (Or movshdup and movhlps have shorter encodings: immediate byte).
7 shuffle uops deserve consideration in comparison with 1 store and 7 downloads, but not if you are still going to spill a vector to call a function.
ABI recommendations:
You are on Windows, where xmm6-15 are stored in codes (only low128, the upper halves of ymm6-15 are knocked out by a call). This is another reason to start with vextractf128 .
In SysV ABI, all xmm / ymm / zmm registers are called by a call, so each print() function requires a spill / reload. The only thing to do is to save the memory in memory and call print with the original vector (i.e. print bottom element, because it will ignore the rest of the register). Then movss xmm0, [rsp+4] and call print for the second element, etc.
Itβs not good for you to get all 8 floats nicely unpacked into 8 vector regs, because all of them must be poured separately before the first function call!