Costs of the new AVX512 manual.

Question

Costs of the new AVX512 manual.

I play with the new instruction sets on the AVX512, and I try to understand how they work and how they can be used.

What I'm trying to do is alternate certain data selected by the mask. My little test loads x * 32 bytes of aligned data from memory into two vector registers and compresses them using a dynamic mask (Fig. 1). The resulting vector registers are scattered in memory, so that the two vector registers alternate (Fig. 2).

Figure 1: Compression of two registers of a data vector using the same dynamically created mask.

Figure 2: Scatter storage for interleaving compressed data.

My code is as follows:

void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) { /* Initialize a vector register containing zeroes to get the store mask */ __m512i zeroVec = _mm512_setzero_epi32(); /* Load data */ __m512i dataVec_1 = _mm512_conflict_epi32( data ); __m512i dataVec_2 = _mm512_conflict_epi32( data + 16 ); /* Compress the data */ __m512i compVec_1 = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 ); __m512i compVec_2 = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 ); /* Get the store mask by compare the compressed register with the zero-register (4 means !=) */ __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 ); /* Interleave the selected data */ _mm512_mask_i32scatter_epi32( result, maskStore, vindex, compVec_1, 1 ); _mm512_mask_i32scatter_epi32( result + 1, maskStore, vindex, compVec_2, 1 ); }

I compiled everything with

-O3 -march = knl -lmemkind -mavx512f -mavx512pf

I call the method for 100'000'000 elements. To get an overview of the behavior of the dispersion store, I repeated this measurement with different values for maskCompress. I was expecting some correlation between the time needed to execute and the number of bits set in maskCompress. But I noticed that the tests took about the same time to execute. Here is the result of a performance test: Figure 3: Measurement results. The x axis represents the number of elements written, depending on the mask. The y axis shows performance.

As you can see, performance gets higher when more data is written to memory.

I learned a little and came to the following: Delayed learning avx512 . At this link, the latency of the instructions used is constant. But to be honest, I'm a little confused by this behavior.

Regarding the answers of Christoph and Peter, I changed my approach a bit. Thus, I have no idea how I can use unpackhi / unpacklo to alternate sparse vector registers, I just combined the compressed AVX512 with shuffle (vpermi):

 int zip_store_vpermit_cnt( uint32_t const * const data, int const compressMask, uint32_t * const result, std::ofstream & log ) { __m512i data1 = _mm512_undefined_epi32(); __m512i data2 = _mm512_undefined_epi32(); __m512i comp_vec1 = _mm512_undefined_epi32(); __m512i comp_vec2 = _mm512_undefined_epi32(); __mmask16 comp_mask = compressMask; __mmask16 shuffle_mask; uint32_t store_mask = 0; __m512i shuffle_idx_lo = _mm512_set_epi32( 23, 7, 22, 6, 21, 5, 20, 4, 19, 3, 18, 2, 17, 1, 16, 0 ); __m512i shuffle_idx_hi = _mm512_set_epi32( 31, 15, 30, 14, 29, 13, 28, 12, 27, 11, 26, 10, 25, 9, 24, 8 ); std::size_t pos = 0; int pcount = 0; int fullVec = 0; for( std::size_t i = 0; i < ELEM_COUNT; i += 32 ) { /* Loading the current data */ data1 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i]) ) ); data2 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i+16]) ) ); shuffle_mask = _mm512_cmp_epi32_mask( zero, data2, 4 ); /* Interleaving the two vector register, depending on the compressMask */ pcount = 2*( __builtin_popcount( comp_mask ) ); store_mask = std::pow( 2, (pcount) ) - 1; fullVec = pcount / 17; comp_vec1 = _mm512_permutex2var_epi32( data1, shuffle_idx_lo, data2 ); _mm512_mask_storeu_epi32( &(result[pos]), store_mask, comp_vec1 ); pos += (fullVec) * 16 + ( ( 1 - ( fullVec ) ) * pcount ); // same as pos += ( pCount >= 16 ) ? 16 : pCount; _mm512_mask_storeu_epi32( &(result[pos]), (store_mask >> 16) , comp_vec2 ); pos += ( fullVec ) * ( pcount - 16 ); // same as pos += ( pCount >= 16 ) ? pCount - 16 : 0; //a simple _mm512_store_epi32 produces a segfault, because the memory isn't aligned anymore :( } return pos; }

Thus, sparse data in two vector registers can alternate. Unfortunately, I have to manually calculate the mask for the store. It seems pretty expensive. You can use LUT to avoid the computation, but I think this is not the way it should be.

Figure 4: Performance test results of 4 different types of storage.

I know this is not an ordinary way, but I have 3 questions related to this topic and I hope that I can help me.

Why is masked storage with only one bit necessary at the same time as masked storage, where all bits are set?
Does anyone have experience or good documentation to understand the behavior of the AVX512 scatter repository?
Is there an easier or more efficient way to alternate two vector registers?

Thank you for your help!

Yours faithfully

+7

performance x86 avx512 intrinsics

Hymir Sep 04 '17 at 18:23

source share

No one has answered this question yet.

See similar questions:

23

AVX2, what is the most efficient way to pack left based on a mask?

or similar:

2847

Improve SQLite performance per second per second?

640

How to create a new instance of an object from a Type

626

What is the purpose of the LEA instruction?

283

What is the execution cost for a docker container

8

Allelicity of elementary elements in vector loading / storage and collection / scattering?

8

How can I write QuadWord from register zxx26 AVX512 to register rax?

6

Instructions AVX512 log2 or pow

3

Temporary stocks of packed binary vector parts using SSE / AVX

one

Using scatter storage from the avx-512

0

AVX512 illegal instruction

Costs of the new AVX512 manual.

More articles: