Optimal SIMD algorithm for rotating or transposing an array

I am working on a data structure where I have an array of 16 uint64. They are arranged as it is in memory (each below represents one int64):

A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3 

The desired result is to convert the array to this:

 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 

Rotating an array of 90 degrees is also an acceptable solution for a future cycle:

 D0 C0 B0 A0 D1 C1 B1 A1 D2 C2 B2 A2 D3 C3 B3 A3 

I need this in order to quickly work with the arrow at a later point (move it alternately with another SIMD ride, 4 at a time).

So far, I have been trying to "mix" the data by loading the 4 x 64-bit vector A, bitmasking and shuffling the elements and OR'ing it with B, etc., and then repeating this for C ... Unfortunately, this 5 x 4 SIMD instructions per segment of 4 elements in the array (one load, one mask, one shuffle, one or with the next element and, finally, the store). It seems I should be able to do better.

I have AVX2 and am compiling with clang.

+8
assembly intel simd avx2 transpose
source share
2 answers
 uint64_t A[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; __m256i row0 = _mm256_loadu_si256((__m256i*)&A[ 0]); //0 1 2 3 __m256i row1 = _mm256_loadu_si256((__m256i*)&A[ 4]); //4 5 6 7 __m256i row2 = _mm256_loadu_si256((__m256i*)&A[ 8]); //8 9 ab __m256i row3 = _mm256_loadu_si256((__m256i*)&A[12]); //cdef 

I don’t have the equipment to check this now, but something like the following should do what you want

 __m256i tmp3, tmp2, tmp1, tmp0; tmp0 = _mm256_unpacklo_epi64(row0, row1); //0 4 2 6 tmp1 = _mm256_unpackhi_epi64(row0, row1); //1 5 3 7 tmp2 = _mm256_unpacklo_epi64(row2, row3); //8 cae tmp3 = _mm256_unpackhi_epi64(row2, row3); //9 dbf //now select the appropriate 128-bit lanes row0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x20); //0 4 8 c row1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x20); //1 5 9 d row2 = _mm256_permute2x128_si256(tmp0, tmp2, 0x31); //2 6 ae row3 = _mm256_permute2x128_si256(tmp1, tmp3, 0x31); //3 7 bf 

 __m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm) 

built-in selection of 128-bit bands from two sources. You can read about this in the Intel Intelligent Guide . There is a version of _mm256_permute2f128_si256 that only requires AVX and is valid in a floating point domain. I used this to verify that I used the correct control words.

+10
source share

An alternative is to use the collection commands, you can directly load the transposed matrix. The five lines of code below correspond to gcc on the i7-Haswell.

  int32_t stride = 4 * sizeof(A[0]); __m128i r128_gather_idx = _mm_set_epi32(3 * stride, 2 * stride, 1 * stride, 0 * stride); __m256i row0 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 0]), r128_gather_idx, 1); __m256i row1 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 1]), r128_gather_idx, 1); __m256i row2 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 2]), r128_gather_idx, 1); 
+4
source share

All Articles