Calculation of the inner product of vectors with allowed scalar values ​​0, 1 and 2 using the AVX properties

I am making an internal product of two tens of thousands dimension columns. Values ​​can only be 0, 1, or 2. They can be saved as characters. If for vectorizing calculations on a processor with the avx flag, I expect it to be 32 times faster. But the problem is that multiplication automatically converts characters to integers, which are 4 bytes. Thus, you can get a maximum of 8 times only in speed. Is it possible to achieve speeds of 32 times?

btw, I am using Linux (Fedora 22 today) with g ++ 5.1.

+5
source share
3 answers

Assuming you have AVX2 (and not just AVX, which is intended only for floating point), you can use the vpmaddubsw instruction, for which the internal:

 __m256i _mm256_maddubs_epi16 (__m256i a, __m256i b) 

This does the multiplication of 8 bits x 8 bits (signed x unsigned, but that doesn’t matter for your case) and then adds a pair of adjacent terms to get the result of 16 bits. [1] This effectively gives you 32 x 8 x 8 bit multiplication in one instruction.

If you do not have AVX2, you can use the 128-bit version of SSE ( _mm_maddubs_epi16 ) to get 16 xx 8 x 8 bit bits in one instruction.

Note that horizontal summation of 16-bit terms can take several instructions, but since your input range is quite small, you only need to perform this horizontal summation relatively infrequently. One possible method (for SSE):

 v = _mm_madd_epi16(v, _mm_set1_epi16(1)); // unpack/sum 16 -> 32 v = _mm_add_epi32(v, _mm_srli_si128(v, 8)); // shift and add 32 bit terms v = _mm_add_epi32(v, _mm_srli_si128(v, 4)); sum = _mm_cvtsi128_si32(v); // extract sum as scalar 

The implementation of AVX2 above is given as an exercise for the reader.

+12
source

It seems that the AVX instruction set does not have 8-bit multiplication, just an addition. The Intel Internal Interfaces Guide does not contain any 8-bit operations starting with _mm_mul* . ( edit : there is actually an 8-bit multiplication, but it has a misleading name - see @PaulR answer)

However, there is an alternative approach. Since the only valid values ​​are 0, 1, and 2, and you are calculating the internal product, you can use bitwise operations instead of multiplications.

In the first vector A use the following encoding:

 0 = 0b00000000 = 0x00 1 = 0b00010011 = 0x13 2 = 0b00001111 = 0x0F 

In the second vector B use the following encoding:

 0 = 0b00000000 = 0x00 1 = 0b00011100 = 0x1C 2 = 0b00001111 = 0x0F 

Now calculate popcount(A & B) . And-IN will cause the corresponding 8-bit cells to have 0, 1, 2, or 4 bits, and popcount will add them together. You can pack a single value into 5 bits of an integer, so if you can use 256-bit ints, you can get 51 times the bandwidth.

+8
source

I guess it's worth a try using bit operations.

Suppose all numbers are either 0 or 1. Then you can pack both vectors into bitmaps. Then the internal product is calculated by:

 for (int i = 0; i < N; i += 256) res += popcount(A[i..i+255] & B[i..i+255]); 

Operation and, of course, is present in AVX / AVX2. The most difficult question is how to quickly calculate popcount for YMM registers.

Suppose now that we are given 0, 1, and 2. For each vector A of integers, two bit vectors A1 and A2 are added:

 A1[i] = (A[i] >= 1); A2[i] = (A[i] >= 2); 

Now we can notice that:

 A[i] * B[i] = A1[i] * B1[i] + A1[i] * B2[i] + A2[i] * B1[i] + A2[i] * B2[i]; 

So, we can calculate the internal product with the following pseudo-code:

 for (int i = 0; i < N; i += 256) { res += popcount(A1[i..i+255] & B1[i..i+255]); res += popcount(A2[i..i+255] & B1[i..i+255]); res += popcount(A1[i..i+255] & B2[i..i+255]); res += popcount(A2[i..i+255] & B2[i..i+255]); } 

This allows you to process 256 elements per iteration, but each iteration becomes 4 times slower. Effectively, it is 64 elements per operation. Since popcount is likely to be the slowest part of the calculations, we can say that N / 64 popcount_256 operations are required to calculate the internal product.

EDIT: I decided to add a small example for this idea:

 A = {01212012210}; //input array A B = {21221100120}; //input array B A1 = {01111011110}; //A should be stored in two halves like this A2 = {00101001100}; B1 = {11111100110}; //B is stored in similar two halves B2 = {10110000010}; A1 & B1 = {01111000110}, popcount = 6; //computing pairwise and-s + popcounts A1 & B2 = {00110000010}, popcount = 3; A2 & B1 = {00101000100}, popcount = 3; A2 & B2 = {00100000000}, popcount = 1; res = 6 + 3 + 3 + 1 = 13 //summing all the popcounts 
+2
source

All Articles