2d Array Access Vectorization (GCC)

I understand the basic ideas of vectorization. I am thinking of converting one of my programs into a vectorized version. But it seems complicated.

There is a table (2d array) table[M][N]and two vectors X[1..4]and Y[1..4]. Can I perform the operations as shown below? Any thoughts?

X[1..4] = table[X[1..4]][Y[1..4]]

(serial version: X[i] = table[X[i]][Y[i]])

In another word, is it possible to vectorize the next cycle?

    for(k=0; k<4; k++) {
        tmp1 = X[k];
        tmp2 = Y[k];
        X[k] = table[tmp1][tmp2];
    }

Note. always contains different meanings . X[]

It is currently implemented in C.

+4
source share
4 answers

1) , hayesti (/hoirzontal SSE/AVX vgather AVX2 ). , GCC4.9, ICC ( intrinsics/hand-encoding ), GCC , , #pragma omp simd, -vec-threshold0 ICC SSE.

2) ,, " ", , "" () vgather vinsert-s ( ), "". , ..

, ICC report ( " Intelizationization Advisor" ).

  • SSE : 0,5x (.. )
  • AVX : 1.1x ( )
  • AVX2 : 1.3x - 1.4x ( ).

, , ( , GCC ). , , - , 1.4x, AVX2, . , () X [k] , ( ).

. AVX-512 (KNL Xeon Phi, Xeon) vgather, , /, , , AVX/AVX2.

: ( ), , , , , SIMD, .

+2

, , . , ​​ x86- AVX2 Haswell. .

vr1 := simd_load4(x)
vr2 := simd_load4(y)
vr3 := vr1 * 4; // multiply by the number of rows
vr4 := vr3 + vr2;
vr5 := simd_gather(base=&table, offsets=vr4)
simd_store(x, vr5)

SSE/AVX :

__m128i vr1 = _mm_load_si128 (x);
__m128i vr2 = _mm_load_si128 (y);
__m128i vr3 = _mm_mul_epi32 (vr1, _mm_set1_epi32 (4));
__m128i vr4 = _mm_add_epi32 (vr3, vr2);
__m128i vr5 = _mm_i32gather_epi32 (table, vr4, 1);
_mm_store_si128 (x, vr5);
+1

If you are copying adjacent memory cells, you can use memcpy () to copy the entire piece of data. But since this is not the case here, you can use a loop.

0
source

This can be done on ARM NEON via the VTBL instruction.

NEON can quickly process LUTs up to 32 bytes.

0
source

All Articles