uint64_t A[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; __m256i row0 = _mm256_loadu_si256((__m256i*)&A[ 0]);
I donβt have the equipment to check this now, but something like the following should do what you want
__m256i tmp3, tmp2, tmp1, tmp0; tmp0 = _mm256_unpacklo_epi64(row0, row1); //0 4 2 6 tmp1 = _mm256_unpackhi_epi64(row0, row1); //1 5 3 7 tmp2 = _mm256_unpacklo_epi64(row2, row3); //8 cae tmp3 = _mm256_unpackhi_epi64(row2, row3); //9 dbf //now select the appropriate 128-bit lanes row0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x20); //0 4 8 c row1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x20); //1 5 9 d row2 = _mm256_permute2x128_si256(tmp0, tmp2, 0x31); //2 6 ae row3 = _mm256_permute2x128_si256(tmp1, tmp3, 0x31); //3 7 bf
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm)
built-in selection of 128-bit bands from two sources. You can read about this in the Intel Intelligent Guide . There is a version of _mm256_permute2f128_si256 that only requires AVX and is valid in a floating point domain. I used this to verify that I used the correct control words.
Z boson
source share