2d Array Access Vectorization (GCC)

Question

2d Array Access Vectorization (GCC)

I understand the basic ideas of vectorization. I am thinking of converting one of my programs into a vectorized version. But it seems complicated.

There is a table (2d array) table[M][N]and two vectors X[1..4]and Y[1..4]. Can I perform the operations as shown below? Any thoughts?

X[1..4] = table[X[1..4]][Y[1..4]]

(serial version: X[i] = table[X[i]][Y[i]])

In another word, is it possible to vectorize the next cycle?

    for(k=0; k<4; k++) {
        tmp1 = X[k];
        tmp2 = Y[k];
        X[k] = table[tmp1][tmp2];
    }

Note. always contains different meanings . X[]

It is currently implemented in C.

+4

gcc arrays vectorization vector simd

Jackwm Jun 22 '15 at 5:02

source share

4 answers

zam · Answer 1 · 2015-06-22T21:26:01+0000

1) , hayesti (/hoirzontal SSE/AVX vgather AVX2 ). , GCC4.9, ICC ( intrinsics/hand-encoding ), GCC , , #pragma omp simd, -vec-threshold0 ICC SSE.

2) ,, " ", , "" () vgather vinsert-s ( ), "". , ..

, ICC report ( " Intelizationization Advisor" ).

SSE : 0,5x (.. )
AVX : 1.1x ( )
AVX2 : 1.3x - 1.4x ( ).

, , ( , GCC ). , , - , 1.4x, AVX2, . , () X [k] , ( ).

. AVX-512 (KNL Xeon Phi, Xeon) vgather, , /, , , AVX/AVX2.

: ( ), , , , , SIMD, .

hayesti · Answer 2 · 2015-06-22T17:00:09+0000

, , . , x86- AVX2 Haswell. .

vr1 := simd_load4(x)
vr2 := simd_load4(y)
vr3 := vr1 * 4; // multiply by the number of rows
vr4 := vr3 + vr2;
vr5 := simd_gather(base=&table, offsets=vr4)
simd_store(x, vr5)

SSE/AVX :

__m128i vr1 = _mm_load_si128 (x);
__m128i vr2 = _mm_load_si128 (y);
__m128i vr3 = _mm_mul_epi32 (vr1, _mm_set1_epi32 (4));
__m128i vr4 = _mm_add_epi32 (vr3, vr2);
__m128i vr5 = _mm_i32gather_epi32 (table, vr4, 1);
_mm_store_si128 (x, vr5);

Lundin · Answer 3 · 2015-06-22T06:11:22+0000

If you are copying adjacent memory cells, you can use memcpy () to copy the entire piece of data. But since this is not the case here, you can use a loop.

Jake 'Alquimista' LEE · Answer 4 · 2015-06-24T02:28:57+0000

This can be done on ARM NEON via the VTBL instruction.

NEON can quickly process LUTs up to 32 bytes.

2d Array Access Vectorization (GCC)

More articles: