Fast count of zero bytes in an array

Question

Fast count of zero bytes in an array

What is a quick way to count the number of byte zero values in a large contiguous array? (Or, conversely, the number of non-zero bytes.) By large, I mean 2 ¹⁶ bytes or more. The position and length of the array can consist of any byte alignment.

Naive way:

int countZeroBytes(byte[] values, int length) { int zeroCount = 0; for (int i = 0; i < length; ++i) if (!values[i]) ++zeroCount; return zeroCount; }

For my problem, I usually maintain zeroCount and update it based on specific changes in values . However, I would like to have a quick general method for recalculating zeroCount after an arbitrary bulk change to values occurs. I'm sure there is a slightly cool way to do it faster, but alas, I'm just a beginner tweedler.

EDIT: Several people asked about the nature of the data marked with zeros, so I will describe them. (It would be nice if the solutions were still general.)

Basically, imagine a world made up of voxels (e.g. Minecraft ), with procedure-treated terrain divided into cubic chunks, or efficient memory pages indexed as three-dimensional arrays. Each voxel is weighted by weight as a unique byte corresponding to a unique material (air, stone, water, etc.). Many pieces contain only air or water, while others contain various combinations of 2-4 voxels in large quantities (dirt, sand, etc.), and effectively 2-10% of voxels are random emissions. Voxels, existing in large numbers, tend to be highly clustered along each axis.

It seems that a counting method with a zero byte would be useful in a number of unrelated scenarios. Therefore, the pursuit of a common solution.

+7

c ++ c bit-manipulation

Philip Jan 4 '14 at 22:49

source share

6 answers

It will be like O (n), so best of all you can reduce the constant. One quick fix is to delete a branch. This gives the result as fast as my SSE version below if the zeros are randomly distributed. This is probably due to the fact that GCC vectorizes this cycle. However, for long starts of zeros or for a random zero density of less than 1%, the SSE version below is still faster.

 int countZeroBytes_fix(char* values, int length) { int zeroCount = 0; for(int i=0; i<length; i++) { zeroCount += values[i] == 0; } return zeroCount; }

Initially, I thought that the density of zeros would matter. This is not the case, at least with SSE. Using SSE is much faster regardless of density.

Edit: in fact, it depends on the density, since the density of zeros should be less than I expected. 1/64 zeros (1.5% zeros) - one zero in 1/4 SSE is recorded, so branch prediction does not work very well. However, 1/1024 zeros (0.1% zeros) are faster (see the Times table).

SIMD is even faster if the data has long strings of zeros.

You can pack 16 bytes into the SSE register. Then you can compare all 16 bytes at the same time with zero using _mm_cmpeq_epi8 . Then, for processing zeros, you can use _mm_movemask_epi8 for the result, and most of the time it will be zero. You can get speeds of up to 16 in this case (for the first half of 1 and the second half of zero, I got more than 12 times acceleration).

Here is a table of times in seconds for 2 ^ 16 bytes (with a repeat of 10000).

  1.5% zeros 50% zeros 0.1% zeros 1st half 1, 2nd half 0 countZeroBytes 0.8s 0.8s 0.8s 0.95s countZeroBytes_fix 0.16s 0.16s 0.16s 0.16s countZeroBytes_SSE 0.2s 0.15s 0.10s 0.07s

You can see the results for the last 1/2 zeros at http://coliru.stacked-crooked.com/a/67a169ddb03d907a

 #include <stdio.h> #include <stdlib.h> #include <emmintrin.h> // SSE2 #include <omp.h> int countZeroBytes(char* values, int length) { int zeroCount = 0; for(int i=0; i<length; i++) { if (!values[i]) ++zeroCount; } return zeroCount; } int countZeroBytes_SSE(char* values, int length) { int zeroCount = 0; __m128i zero16 = _mm_set1_epi8(0); __m128i and16 = _mm_set1_epi8(1); for(int i=0; i<length; i+=16) { __m128i values16 = _mm_loadu_si128((__m128i*)&values[i]); __m128i cmp = _mm_cmpeq_epi8(values16, zero16); int mask = _mm_movemask_epi8(cmp); if(mask) { if(mask == 0xffff) zeroCount += 16; else { cmp = _mm_and_si128(and16, cmp); //change -1 values to 1 //hortiontal sum of 16 bytes __m128i sum1 = _mm_sad_epu8(cmp,zero16); __m128i sum2 = _mm_shuffle_epi32(sum1,2); __m128i sum3 = _mm_add_epi16(sum1,sum2); zeroCount += _mm_cvtsi128_si32(sum3); } } } return zeroCount; } int main() { const int n = 1<<16; const int repeat = 10000; char *values = (char*)_mm_malloc(n, 16); for(int i=0; i<n; i++) values[i] = rand()%64; //1.5% zeros //for(int i=0; i<n/2; i++) values[i] = 1; //for(int i=n/2; i<n; i++) values[i] = 0; int zeroCount = 0; double dtime; dtime = omp_get_wtime(); for(int i=0; i<repeat; i++) zeroCount = countZeroBytes(values,n); dtime = omp_get_wtime() - dtime; printf("zeroCount %d, time %f\n", zeroCount, dtime); dtime = omp_get_wtime(); for(int i=0; i<repeat; i++) zeroCount = countZeroBytes_SSE(values,n); dtime = omp_get_wtime() - dtime; printf("zeroCount %d, time %f\n", zeroCount, dtime); }

+4

Z boson Jan 05 '14 at 12:06

source share

In situations where 0s are common, it would be faster to check 64 bytes at a time and check only bytes if the range is not zero. If zero is rare, it will be more expensive. This code assumes that the large block is divisible by 64. It also assumes that memcmp is as efficient as you can get.

 int countZeroBytes(byte[] values, int length) { static const byte zeros[64]={}; int zeroCount = 0; for (int i = 0; i < length; i+=64) { if (::memcmp(values+i, zeros, 64) == 0) { zeroCount += 64; } else { for (int j=i; j < i+64; ++j) { if (!values[j]) { ++zeroCount; } } } } return zeroCount; }

+1

Glenn teitelbaum Jan 05 '14 at 15:57

source share

You can also use the POPCNT command, which returns the number of bits. This allows you to further simplify the code and speed it up by eliminating unnecessary branches. Here is an example with AVX2 and POPCNT:

 #include <stdint.h> #include <stdlib.h> #include <stdio.h> #include "immintrin.h" int countZeroes(uint8_t* bytes, int length) { const __m256i vZero = _mm256_setzero_si256(); int count = 0; for (int n = 0; n < length; n += 32) { __m256i v = _mm256_load_si256((const __m256i*)&bytes[n]); v = _mm256_cmpeq_epi8(v, vZero); int k = _mm256_movemask_epi8(v); count += _mm_popcnt_u32(k); } return count; } #define SIZE 1024 int main() { uint8_t bytes[SIZE] __attribute__((aligned(32))); for (int z = 0; z < SIZE; ++z) bytes[z] = z % 2; int n = countZeroes(bytes, SIZE); printf("%d\n", n); return 0; }

+1

Daniel Frużyński Oct 30 '17 at 20:02

source share

Perhaps it is faster to avoid the condition and exchange it for a search and add:

 char isCharZeroLUT[256] = { 1 }; /* 1 0 0 ... */ int zeroCount = 0; for (int i = 0; i < length; ++i) { zeroCount += isCharZeroLUT[values[i]]; }

I did not measure the difference. It is also worth noting that some compiler happily draws quite simple loops.

0

Dietmar Kühl Jan 4 '14 at 10:56

source share

Brute force for counting zero bytes: use the vector comparison instruction, which sets each vector byte to 1 if this byte is 0, and to 0 if this byte is not zero.

Do this 255 times to process up to 255 x 64 bytes (if you have a 512-bit instruction, or 255 x 32 or 255 x 16 bytes if you have only 128-bit vectors). And then you just add 255 result vectors. Since each byte after the comparison had a value of 0 or 1, each sum is no more than 255, so now you have one vector of 64/32/16 bytes, compared with about 16 000/8 000/4 000 bytes.

0

gnasher729 Oct 30 '17 at 20:37

source share

mcleod_ideafix · Accepted Answer · 2014-01-04T23:37:12+0000

I came up with this OpenMP implementation, which can take advantage of the array located in the local cache of each processor to actually read it in parallel.

 nzeros_total = 0; #pragma omp parallel for reduction(+:nzeros_total) for (i=0;i<NDATA;i++) { if (v[i]==0) nzeros_total++; }

The quick test is that it starts a 1000-fold for loop with a naive implementation (the same one that OP wrote in the question) compared to the OpenMP implementation, and works 1000 times, providing the best time for both methods, an array of 65536 int with a probability of an element with a zero value of 50%, using Windows 7 on a QuadCore processor and compiled with VStudio 2012 Ultimate, gives the following numbers:

  DEBUG RELEASE Naive method: 580 microseconds. 341 microseconds. OpenMP method: 159 microseconds. 99 microseconds.

NOTE. I tried #pragma loop (hint_parallel(4)) , but most likely it didn’t cause the naive version to work better, so I assume that the compiler has already applied this optimization or not at all, Also, #pragma loop (no_vector) did not degrade the naive version.

Fast count of zero bytes in an array

More articles: