CUDA: getting the maximum value and index in an array

Question

CUDA: getting the maximum value and index in an array

I have several blocks, each block is executed on a separate part of the integer array. As an example: we lock one of the array [0] into the array [9] and block two from the array [10] into the array [20].

What is the best way to get the index of the maximum array value for each block?

An example of a one a [0] - [10] block has the following meanings:
5 10 2 3 4 34 56 3 9 10

Thus, 56 is the highest value in index 6.

I cannot use shared memory because the size of the array can be very large. Therefore, it does not fit. Are there libraries that let me do this fast?

I know about the reduction algorithm, but I think my case is different from what I want to get the index of the largest element.

+4

cuda

lina Apr 19 '11 at 17:27

source share

5 answers

fabrizioM · Answer 1 · 2011-04-19T19:43:12+0000

If I understand what exactly you want, get: Get the index for array A of the maximum value inside it.

If this is true, I suggest you use the traction library:

Here's how you do it:

 #include <thrust/device_vector.h> #include <thrust/tuple.h> #include <thrust/reduce.h> #include <thrust/fill.h> #include <thrust/generate.h> #include <thrust/sort.h> #include <thrust/sequence.h> #include <thrust/copy.h> #include <cstdlib> #include <time.h> using namespace thrust; // return the biggest of two tuples template <class T> struct bigger_tuple { __device__ __host__ tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b) { if (a > b) return a; else return b; } }; template <class T> int max_index(device_vector<T>& vec) { // create implicit index sequence [0, 1, 2, ... ) counting_iterator<int> begin(0); counting_iterator<int> end(vec.size()); tuple<T,int> init(vec[0],0); tuple<T,int> smallest; smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)), init, bigger_tuple<T>()); return get<1>(smallest); } int main(){ thrust::host_vector<int> h_vec(1024); thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices // transfer data to the device thrust::device_vector<int> d_vec = h_vec; int index = max_index(d_vec); std::cout << "Max index is:" << index <<std::endl; std::cout << "Value is: " << h_vec[index] <<std::endl; return 0; }

Leo · Answer 2 · 2012-07-13T14:28:24+0000

This will not benefit the original poster, but for those who came to this page, looking for an answer. I would prefer to use the emphasis, which already has the thrust :: max_element function, which does just that - returns the index of the largest element. The functions min_element and minmax_element are also provided. See traction documentation here for more details.

Edric · Answer 3 · 2011-04-21T07:31:23+0000

Like the suggestion to use Thrust, you can also use the CUBLAS cublasIsamax function.

William pursell · Answer 4 · 2011-04-22T13:58:07+0000

The size of your array is almost irrelevant compared to shared memory, since the number of threads in each block is a limiting factor, not the size of the array. One solution is for each stream block to work with an array size of the same size as the stream block. That is, if you have 512 threads, then block n will look for the array [n] through the array [n + 511]. Each block performs a reduction to find the highest element in this part of the array. Then you return the maximum of each section back to the host and do a simple linear search to find the maximum value in the shared array. With each decrease, the GPU reduces linear search by 512 times. Depending on the size of the array, you can make more reductions before returning the data. (If your array is 3 * 512 ^ 10 in size, you can make 10 reductions on gpu and search the host for the 3 remaining data points.)

Robotbugs · Answer 5 · 2012-03-22T23:09:59+0000

One thing to consider when executing the maximum value plus decreasing the index is that if your array has more than one identical value of the maximum element, i.e. in your example, if there were 2 or more values equal to 56, the index that will be returned will not be unique and will probably be different each time the code is run, because the synchronization of the sequence of threads through the GPU is not deterministic.

To work around this problem, you can use a unique ordering index, such as threadid + threadsperblock * blockid, or an element index location if it is unique. Then the maximum test is performed on these lines:

 if(a>max_so_far || a==max_so_far && order_a>order_max_so_far) { max_so_far = a; index_max_so_far = index_a; order_max_so_far = order_a; }

(index and order can be the same variable, depending on the application.)

CUDA: getting the maximum value and index in an array

More articles: