I play with the new instruction sets on the AVX512, and I try to understand how they work and how they can be used.
What I'm trying to do is alternate certain data selected by the mask. My little test loads x * 32 bytes of aligned data from memory into two vector registers and compresses them using a dynamic mask (Fig. 1). The resulting vector registers are scattered in memory, so that the two vector registers alternate (Fig. 2).

Figure 1: Compression of two registers of a data vector using the same dynamically created mask.

Figure 2: Scatter storage for interleaving compressed data.
My code is as follows:
void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) { __m512i zeroVec = _mm512_setzero_epi32(); __m512i dataVec_1 = _mm512_conflict_epi32( data ); __m512i dataVec_2 = _mm512_conflict_epi32( data + 16 ); __m512i compVec_1 = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 ); __m512i compVec_2 = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 ); __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 ); _mm512_mask_i32scatter_epi32( result, maskStore, vindex, compVec_1, 1 ); _mm512_mask_i32scatter_epi32( result + 1, maskStore, vindex, compVec_2, 1 ); }
I compiled everything with
-O3 -march = knl -lmemkind -mavx512f -mavx512pf
I call the method for 100'000'000 elements. To get an overview of the behavior of the dispersion store, I repeated this measurement with different values ββfor maskCompress. I was expecting some correlation between the time needed to execute and the number of bits set in maskCompress. But I noticed that the tests took about the same time to execute. Here is the result of a performance test:
Figure 3: Measurement results. The x axis represents the number of elements written, depending on the mask. The y axis shows performance.
As you can see, performance gets higher when more data is written to memory.
I learned a little and came to the following: Delayed learning avx512 . At this link, the latency of the instructions used is constant. But to be honest, I'm a little confused by this behavior.
Regarding the answers of Christoph and Peter, I changed my approach a bit. Thus, I have no idea how I can use unpackhi / unpacklo to alternate sparse vector registers, I just combined the compressed AVX512 with shuffle (vpermi):
int zip_store_vpermit_cnt( uint32_t const * const data, int const compressMask, uint32_t * const result, std::ofstream & log ) { __m512i data1 = _mm512_undefined_epi32(); __m512i data2 = _mm512_undefined_epi32(); __m512i comp_vec1 = _mm512_undefined_epi32(); __m512i comp_vec2 = _mm512_undefined_epi32(); __mmask16 comp_mask = compressMask; __mmask16 shuffle_mask; uint32_t store_mask = 0; __m512i shuffle_idx_lo = _mm512_set_epi32( 23, 7, 22, 6, 21, 5, 20, 4, 19, 3, 18, 2, 17, 1, 16, 0 ); __m512i shuffle_idx_hi = _mm512_set_epi32( 31, 15, 30, 14, 29, 13, 28, 12, 27, 11, 26, 10, 25, 9, 24, 8 ); std::size_t pos = 0; int pcount = 0; int fullVec = 0; for( std::size_t i = 0; i < ELEM_COUNT; i += 32 ) { data1 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i]) ) ); data2 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i+16]) ) ); shuffle_mask = _mm512_cmp_epi32_mask( zero, data2, 4 ); pcount = 2*( __builtin_popcount( comp_mask ) ); store_mask = std::pow( 2, (pcount) ) - 1; fullVec = pcount / 17; comp_vec1 = _mm512_permutex2var_epi32( data1, shuffle_idx_lo, data2 ); _mm512_mask_storeu_epi32( &(result[pos]), store_mask, comp_vec1 ); pos += (fullVec) * 16 + ( ( 1 - ( fullVec ) ) * pcount );
Thus, sparse data in two vector registers can alternate. Unfortunately, I have to manually calculate the mask for the store. It seems pretty expensive. You can use LUT to avoid the computation, but I think this is not the way it should be.
Figure 4: Performance test results of 4 different types of storage.
I know this is not an ordinary way, but I have 3 questions related to this topic and I hope that I can help me.
Why is masked storage with only one bit necessary at the same time as masked storage, where all bits are set?
Does anyone have experience or good documentation to understand the behavior of the AVX512 scatter repository?
Is there an easier or more efficient way to alternate two vector registers?
Thank you for your help!
Yours faithfully