Calculation of the inner product of vectors with allowed scalar values 0, 1 and 2 using the AVX properties

Question

Calculation of the inner product of vectors with allowed scalar values 0, 1 and 2 using the AVX properties

I am making an internal product of two tens of thousands dimension columns. Values can only be 0, 1, or 2. They can be saved as characters. If for vectorizing calculations on a processor with the avx flag, I expect it to be 32 times faster. But the problem is that multiplication automatically converts characters to integers, which are 4 bytes. Thus, you can get a maximum of 8 times only in speed. Is it possible to achieve speeds of 32 times?

btw, I am using Linux (Fedora 22 today) with g ++ 5.1.

+5

c ++ avx simd

qtl Jul 13 '15 at 11:30

source share

3 answers

Paul r · Answer 1 · 2015-07-13T12:03:26+0000

Assuming you have AVX2 (and not just AVX, which is intended only for floating point), you can use the vpmaddubsw instruction, for which the internal:

 __m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)

This does the multiplication of 8 bits x 8 bits (signed x unsigned, but that doesn’t matter for your case) and then adds a pair of adjacent terms to get the result of 16 bits. [1] This effectively gives you 32 x 8 x 8 bit multiplication in one instruction.

If you do not have AVX2, you can use the 128-bit version of SSE ( _mm_maddubs_epi16 ) to get 16 xx 8 x 8 bit bits in one instruction.

Note that horizontal summation of 16-bit terms can take several instructions, but since your input range is quite small, you only need to perform this horizontal summation relatively infrequently. One possible method (for SSE):

 v = _mm_madd_epi16(v, _mm_set1_epi16(1)); // unpack/sum 16 -> 32 v = _mm_add_epi32(v, _mm_srli_si128(v, 8)); // shift and add 32 bit terms v = _mm_add_epi32(v, _mm_srli_si128(v, 4)); sum = _mm_cvtsi128_si32(v); // extract sum as scalar

The implementation of AVX2 above is given as an exercise for the reader.

Krzysztof Kosiński · Answer 2 · 2015-07-13T12:10:52+0000

It seems that the AVX instruction set does not have 8-bit multiplication, just an addition. The Intel Internal Interfaces Guide does not contain any 8-bit operations starting with _mm_mul* . ( edit : there is actually an 8-bit multiplication, but it has a misleading name - see @PaulR answer)

However, there is an alternative approach. Since the only valid values are 0, 1, and 2, and you are calculating the internal product, you can use bitwise operations instead of multiplications.

In the first vector A use the following encoding:

 0 = 0b00000000 = 0x00 1 = 0b00010011 = 0x13 2 = 0b00001111 = 0x0F

In the second vector B use the following encoding:

 0 = 0b00000000 = 0x00 1 = 0b00011100 = 0x1C 2 = 0b00001111 = 0x0F

Now calculate popcount(A & B) . And-IN will cause the corresponding 8-bit cells to have 0, 1, 2, or 4 bits, and popcount will add them together. You can pack a single value into 5 bits of an integer, so if you can use 256-bit ints, you can get 51 times the bandwidth.

stgatilov · Answer 3 · 2015-07-21T05:31:27+0000

I guess it's worth a try using bit operations.

Suppose all numbers are either 0 or 1. Then you can pack both vectors into bitmaps. Then the internal product is calculated by:

 for (int i = 0; i < N; i += 256) res += popcount(A[i..i+255] & B[i..i+255]);

Operation and, of course, is present in AVX / AVX2. The most difficult question is how to quickly calculate popcount for YMM registers.

Suppose now that we are given 0, 1, and 2. For each vector A of integers, two bit vectors A1 and A2 are added:

 A1[i] = (A[i] >= 1); A2[i] = (A[i] >= 2);

Now we can notice that:

 A[i] * B[i] = A1[i] * B1[i] + A1[i] * B2[i] + A2[i] * B1[i] + A2[i] * B2[i];

So, we can calculate the internal product with the following pseudo-code:

 for (int i = 0; i < N; i += 256) { res += popcount(A1[i..i+255] & B1[i..i+255]); res += popcount(A2[i..i+255] & B1[i..i+255]); res += popcount(A1[i..i+255] & B2[i..i+255]); res += popcount(A2[i..i+255] & B2[i..i+255]); }

This allows you to process 256 elements per iteration, but each iteration becomes 4 times slower. Effectively, it is 64 elements per operation. Since popcount is likely to be the slowest part of the calculations, we can say that N / 64 popcount_256 operations are required to calculate the internal product.

EDIT: I decided to add a small example for this idea:

 A = {01212012210}; //input array A B = {21221100120}; //input array B A1 = {01111011110}; //A should be stored in two halves like this A2 = {00101001100}; B1 = {11111100110}; //B is stored in similar two halves B2 = {10110000010}; A1 & B1 = {01111000110}, popcount = 6; //computing pairwise and-s + popcounts A1 & B2 = {00110000010}, popcount = 3; A2 & B1 = {00101000100}, popcount = 3; A2 & B2 = {00100000000}, popcount = 1; res = 6 + 3 + 3 + 1 = 13 //summing all the popcounts

Calculation of the inner product of vectors with allowed scalar values ​​0, 1 and 2 using the AVX properties

More articles:

Calculation of the inner product of vectors with allowed scalar values 0, 1 and 2 using the AVX properties