I guess it's worth a try using bit operations.
Suppose all numbers are either 0 or 1. Then you can pack both vectors into bitmaps. Then the internal product is calculated by:
for (int i = 0; i < N; i += 256) res += popcount(A[i..i+255] & B[i..i+255]);
Operation and, of course, is present in AVX / AVX2. The most difficult question is how to quickly calculate popcount for YMM registers.
Suppose now that we are given 0, 1, and 2. For each vector A of integers, two bit vectors A1 and A2 are added:
A1[i] = (A[i] >= 1); A2[i] = (A[i] >= 2);
Now we can notice that:
A[i] * B[i] = A1[i] * B1[i] + A1[i] * B2[i] + A2[i] * B1[i] + A2[i] * B2[i];
So, we can calculate the internal product with the following pseudo-code:
for (int i = 0; i < N; i += 256) { res += popcount(A1[i..i+255] & B1[i..i+255]); res += popcount(A2[i..i+255] & B1[i..i+255]); res += popcount(A1[i..i+255] & B2[i..i+255]); res += popcount(A2[i..i+255] & B2[i..i+255]); }
This allows you to process 256 elements per iteration, but each iteration becomes 4 times slower. Effectively, it is 64 elements per operation. Since popcount is likely to be the slowest part of the calculations, we can say that N / 64 popcount_256 operations are required to calculate the internal product.
EDIT: I decided to add a small example for this idea:
A = {01212012210}; //input array A B = {21221100120}; //input array B A1 = {01111011110}; //A should be stored in two halves like this A2 = {00101001100}; B1 = {11111100110}; //B is stored in similar two halves B2 = {10110000010}; A1 & B1 = {01111000110}, popcount = 6; //computing pairwise and-s + popcounts A1 & B2 = {00110000010}, popcount = 3; A2 & B1 = {00101000100}, popcount = 3; A2 & B2 = {00100000000}, popcount = 1; res = 6 + 3 + 3 + 1 = 13 //summing all the popcounts