It will be like O (n), so best of all you can reduce the constant. One quick fix is to delete a branch. This gives the result as fast as my SSE version below if the zeros are randomly distributed. This is probably due to the fact that GCC vectorizes this cycle. However, for long starts of zeros or for a random zero density of less than 1%, the SSE version below is still faster.
int countZeroBytes_fix(char* values, int length) { int zeroCount = 0; for(int i=0; i<length; i++) { zeroCount += values[i] == 0; } return zeroCount; }
Initially, I thought that the density of zeros would matter. This is not the case, at least with SSE. Using SSE is much faster regardless of density.
Edit: in fact, it depends on the density, since the density of zeros should be less than I expected. 1/64 zeros (1.5% zeros) - one zero in 1/4 SSE is recorded, so branch prediction does not work very well. However, 1/1024 zeros (0.1% zeros) are faster (see the Times table).
SIMD is even faster if the data has long strings of zeros.
You can pack 16 bytes into the SSE register. Then you can compare all 16 bytes at the same time with zero using _mm_cmpeq_epi8 . Then, for processing zeros, you can use _mm_movemask_epi8 for the result, and most of the time it will be zero. You can get speeds of up to 16 in this case (for the first half of 1 and the second half of zero, I got more than 12 times acceleration).
Here is a table of times in seconds for 2 ^ 16 bytes (with a repeat of 10000).
1.5% zeros 50% zeros 0.1% zeros 1st half 1, 2nd half 0 countZeroBytes 0.8s 0.8s 0.8s 0.95s countZeroBytes_fix 0.16s 0.16s 0.16s 0.16s countZeroBytes_SSE 0.2s 0.15s 0.10s 0.07s
You can see the results for the last 1/2 zeros at http://coliru.stacked-crooked.com/a/67a169ddb03d907a
#include <stdio.h> #include <stdlib.h> #include <emmintrin.h> // SSE2 #include <omp.h> int countZeroBytes(char* values, int length) { int zeroCount = 0; for(int i=0; i<length; i++) { if (!values[i]) ++zeroCount; } return zeroCount; } int countZeroBytes_SSE(char* values, int length) { int zeroCount = 0; __m128i zero16 = _mm_set1_epi8(0); __m128i and16 = _mm_set1_epi8(1); for(int i=0; i<length; i+=16) { __m128i values16 = _mm_loadu_si128((__m128i*)&values[i]); __m128i cmp = _mm_cmpeq_epi8(values16, zero16); int mask = _mm_movemask_epi8(cmp); if(mask) { if(mask == 0xffff) zeroCount += 16; else { cmp = _mm_and_si128(and16, cmp); //change -1 values to 1 //hortiontal sum of 16 bytes __m128i sum1 = _mm_sad_epu8(cmp,zero16); __m128i sum2 = _mm_shuffle_epi32(sum1,2); __m128i sum3 = _mm_add_epi16(sum1,sum2); zeroCount += _mm_cvtsi128_si32(sum3); } } } return zeroCount; } int main() { const int n = 1<<16; const int repeat = 10000; char *values = (char*)_mm_malloc(n, 16); for(int i=0; i<n; i++) values[i] = rand()%64; //1.5% zeros //for(int i=0; i<n/2; i++) values[i] = 1; //for(int i=n/2; i<n; i++) values[i] = 0; int zeroCount = 0; double dtime; dtime = omp_get_wtime(); for(int i=0; i<repeat; i++) zeroCount = countZeroBytes(values,n); dtime = omp_get_wtime() - dtime; printf("zeroCount %d, time %f\n", zeroCount, dtime); dtime = omp_get_wtime(); for(int i=0; i<repeat; i++) zeroCount = countZeroBytes_SSE(values,n); dtime = omp_get_wtime() - dtime; printf("zeroCount %d, time %f\n", zeroCount, dtime); }