I use CUDA to trigger a problem when I need a complex equation with many input matrices. Each matrix has an identifier depending on its set (from 1 to 30, there are 100,000 matrices), and the result of each matrix is ββstored in the float [N] array, where N is the number of input matrices.
After that, the result that I want is the sum of each float in this array for each identifier, so with 30 identifiers there are 30 resulting floats.
Any suggestions on how I should do this?
Now I read the float array (400kb) back to the host from the device and run it on the host:
// Allocate result_array for 100,000 floats on the device // CUDA process input matrices // Read from the device back to the host into result_array float result[10] = { 0 }; for (int i = 0; i < N; i++) { result[input[i].ID] += result_array[i]; }
But I wonder if there is a better way.
source share