Counting occurrences of numbers in a CUDA array

I have an array of unsigned integers stored on a GPU with CUDA (usually 1000000 elements). I would like to count the occurrence of each number in an array. There are only a few individual numbers (about 10 ), but these numbers can vary from 1 to 1000000 . About 9/10 numbers are 0 , I do not need their number. The result looks something like this:

 58458 -> 1000 occurrences 15 -> 412 occurrences 

I have an implementation using atomicAdd s, but it is too slow (many threads write to the same address). Does anyone know a quick / efficient method?

+7
source share
3 answers

You can implement a histogram by first sorting the numbers, and then doing a reduced snap.

The easiest way is to use thrust::sort and then thrust::reduce_by_key . It is also often much faster than ad hoc binning based on atomism. Here is an example .

+7
source

I suppose you can find help with CUDA examples, in particular histogram examples. They are part of the GPU computing SDK. You can find it here http://developer.nvidia.com/cuda-cc-sdk-code-samples#histogram . They even have a document explaining the algorithms.

+1
source

I compare the two approaches proposed on the duplicated question of how the tremors are counted , namely

  • Using thrust::counting_iterator and thrust::upper_bound , following the example of the Thrust histogram,
  • Using thrust::unique_copy and thrust::upper_bound .

Below is a complete example.

 #include <time.h> // --- time #include <stdlib.h> // --- srand, rand #include <iostream> #include <thrust\host_vector.h> #include <thrust\device_vector.h> #include <thrust\sort.h> #include <thrust\iterator\zip_iterator.h> #include <thrust\unique.h> #include <thrust/binary_search.h> #include <thrust\adjacent_difference.h> #include "Utilities.cuh" #include "TimingGPU.cuh" //#define VERBOSE #define NO_HISTOGRAM /********/ /* MAIN */ /********/ int main() { const int N = 1048576; //const int N = 20; //const int N = 128; TimingGPU timerGPU; // --- Initialize random seed srand(time(NULL)); thrust::host_vector<int> h_code(N); for (int k = 0; k < N; k++) { // --- Generate random numbers between 0 and 9 h_code[k] = (rand() % 10); } thrust::device_vector<int> d_code(h_code); //thrust::device_vector<unsigned int> d_counting(N); thrust::sort(d_code.begin(), d_code.end()); h_code = d_code; timerGPU.StartCounter(); #ifdef NO_HISTOGRAM // --- The number of d_cumsum bins is equal to the maximum value plus one int num_bins = d_code.back() + 1; thrust::device_vector<int> d_code_unique(num_bins); thrust::unique_copy(d_code.begin(), d_code.end(), d_code_unique.begin()); thrust::device_vector<int> d_counting(num_bins); thrust::upper_bound(d_code.begin(), d_code.end(), d_code_unique.begin(), d_code_unique.end(), d_counting.begin()); #else thrust::device_vector<int> d_cumsum; // --- The number of d_cumsum bins is equal to the maximum value plus one int num_bins = d_code.back() + 1; // --- Resize d_cumsum storage d_cumsum.resize(num_bins); // --- Find the end of each bin of values - Cumulative d_cumsum thrust::counting_iterator<int> search_begin(0); thrust::upper_bound(d_code.begin(), d_code.end(), search_begin, search_begin + num_bins, d_cumsum.begin()); // --- Compute the histogram by taking differences of the cumulative d_cumsum //thrust::device_vector<int> d_counting(num_bins); //thrust::adjacent_difference(d_cumsum.begin(), d_cumsum.end(), d_counting.begin()); #endif printf("Timing GPU = %f\n", timerGPU.GetCounter()); #ifdef VERBOSE thrust::host_vector<int> h_counting(d_counting); printf("After\n"); for (int k = 0; k < N; k++) printf("code = %i\n", h_code[k]); #ifndef NO_HISTOGRAM thrust::host_vector<int> h_cumsum(d_cumsum); printf("\nCounting\n"); for (int k = 0; k < num_bins; k++) printf("element = %i; counting = %i; cumsum = %i\n", k, h_counting[k], h_cumsum[k]); #else thrust::host_vector<int> h_code_unique(d_code_unique); printf("\nCounting\n"); for (int k = 0; k < N; k++) printf("element = %i; counting = %i\n", h_code_unique[k], h_counting[k]); #endif #endif } 

The first approach was the fastest. On the NVIDIA GTX 960 board, I had the following timings for several elements of the N = 1048576 array:

 First approach: 2.35ms First approach without thrust::adjacent_difference: 1.52 Second approach: 4.67ms 

Please note that there is no strict need to explicitly calculate the adjacent difference, since this operation can be performed manually during kernel processing, if necessary.

+1
source

All Articles