Costs of the new AVX512 manual.

I play with the new instruction sets on the AVX512, and I try to understand how they work and how they can be used.

What I'm trying to do is alternate certain data selected by the mask. My little test loads x * 32 bytes of aligned data from memory into two vector registers and compresses them using a dynamic mask (Fig. 1). The resulting vector registers are scattered in memory, so that the two vector registers alternate (Fig. 2).

Compression of two vector registers

Figure 1: Compression of two registers of a data vector using the same dynamically created mask.

Alternating Scatter Storage

Figure 2: Scatter storage for interleaving compressed data.

My code is as follows:

void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) { /* Initialize a vector register containing zeroes to get the store mask */ __m512i zeroVec = _mm512_setzero_epi32(); /* Load data */ __m512i dataVec_1 = _mm512_conflict_epi32( data ); __m512i dataVec_2 = _mm512_conflict_epi32( data + 16 ); /* Compress the data */ __m512i compVec_1 = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 ); __m512i compVec_2 = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 ); /* Get the store mask by compare the compressed register with the zero-register (4 means !=) */ __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 ); /* Interleave the selected data */ _mm512_mask_i32scatter_epi32( result, maskStore, vindex, compVec_1, 1 ); _mm512_mask_i32scatter_epi32( result + 1, maskStore, vindex, compVec_2, 1 ); } 

I compiled everything with

-O3 -march = knl -lmemkind -mavx512f -mavx512pf

I call the method for 100'000'000 elements. To get an overview of the behavior of the dispersion store, I repeated this measurement with different values ​​for maskCompress. I was expecting some correlation between the time needed to execute and the number of bits set in maskCompress. But I noticed that the tests took about the same time to execute. Here is the result of a performance test: Measurement results Figure 3: Measurement results. The x axis represents the number of elements written, depending on the mask. The y axis shows performance.

As you can see, performance gets higher when more data is written to memory.

I learned a little and came to the following: Delayed learning avx512 . At this link, the latency of the instructions used is constant. But to be honest, I'm a little confused by this behavior.

Regarding the answers of Christoph and Peter, I changed my approach a bit. Thus, I have no idea how I can use unpackhi / unpacklo to alternate sparse vector registers, I just combined the compressed AVX512 with shuffle (vpermi):

 int zip_store_vpermit_cnt( uint32_t const * const data, int const compressMask, uint32_t * const result, std::ofstream & log ) { __m512i data1 = _mm512_undefined_epi32(); __m512i data2 = _mm512_undefined_epi32(); __m512i comp_vec1 = _mm512_undefined_epi32(); __m512i comp_vec2 = _mm512_undefined_epi32(); __mmask16 comp_mask = compressMask; __mmask16 shuffle_mask; uint32_t store_mask = 0; __m512i shuffle_idx_lo = _mm512_set_epi32( 23, 7, 22, 6, 21, 5, 20, 4, 19, 3, 18, 2, 17, 1, 16, 0 ); __m512i shuffle_idx_hi = _mm512_set_epi32( 31, 15, 30, 14, 29, 13, 28, 12, 27, 11, 26, 10, 25, 9, 24, 8 ); std::size_t pos = 0; int pcount = 0; int fullVec = 0; for( std::size_t i = 0; i < ELEM_COUNT; i += 32 ) { /* Loading the current data */ data1 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i]) ) ); data2 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i+16]) ) ); shuffle_mask = _mm512_cmp_epi32_mask( zero, data2, 4 ); /* Interleaving the two vector register, depending on the compressMask */ pcount = 2*( __builtin_popcount( comp_mask ) ); store_mask = std::pow( 2, (pcount) ) - 1; fullVec = pcount / 17; comp_vec1 = _mm512_permutex2var_epi32( data1, shuffle_idx_lo, data2 ); _mm512_mask_storeu_epi32( &(result[pos]), store_mask, comp_vec1 ); pos += (fullVec) * 16 + ( ( 1 - ( fullVec ) ) * pcount ); // same as pos += ( pCount >= 16 ) ? 16 : pCount; _mm512_mask_storeu_epi32( &(result[pos]), (store_mask >> 16) , comp_vec2 ); pos += ( fullVec ) * ( pcount - 16 ); // same as pos += ( pCount >= 16 ) ? pCount - 16 : 0; //a simple _mm512_store_epi32 produces a segfault, because the memory isn't aligned anymore :( } return pos; } 

Thus, sparse data in two vector registers can alternate. Unfortunately, I have to manually calculate the mask for the store. It seems pretty expensive. You can use LUT to avoid the computation, but I think this is not the way it should be.

Storage performance Figure 4: Performance test results of 4 different types of storage.

I know this is not an ordinary way, but I have 3 questions related to this topic and I hope that I can help me.

  • Why is masked storage with only one bit necessary at the same time as masked storage, where all bits are set?

  • Does anyone have experience or good documentation to understand the behavior of the AVX512 scatter repository?

  • Is there an easier or more efficient way to alternate two vector registers?

Thank you for your help!

Yours faithfully

+7
performance x86 avx512 intrinsics
source share

No one has answered this question yet.

See similar questions:

23
AVX2, what is the most efficient way to pack left based on a mask?

or similar:

2847
Improve SQLite performance per second per second?
640
How to create a new instance of an object from a Type
626
What is the purpose of the LEA instruction?
283
What is the execution cost for a docker container
8
Allelicity of elementary elements in vector loading / storage and collection / scattering?
8
How can I write QuadWord from register zxx26 AVX512 to register rax?
6
Instructions AVX512 log2 or pow
3
Temporary stocks of packed binary vector parts using SSE / AVX
one
Using scatter storage from the avx-512
0
AVX512 illegal instruction

All Articles