CUDA Add Matrix Rows

I am trying to add 4800x9600 matrix rows together, as a result we get a 1x9600 matrix.

What I did was split the 4800x9600 into 9,600 matrices with a length of 4800 each. Then I perform a reduction of 4800 elements.

The problem is that it is very slow ...

Anyone have any suggestions?

Basically, I am trying to implement the sum (...) MATLAB function.

Here is the code I checked that works fine, it is just very slow:

void reduceRows(Matrix Dresult,Matrix DA) { //split DA into chunks Matrix Dchunk; Dchunk.h=1;Dchunk.w=DA.h; cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float)); Matrix DcolSum; DcolSum.h=1;DcolSum.w=1; //cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float)); int i; for(i=0;i<DA.w;i++) //loop over each column { //printf("%d ",i); cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice); DcolSum.data=&Dresult.data[i]; reduceTotal(DcolSum,Dchunk); } cudaFree(Dchunk.data); } 

The matrix is โ€‹โ€‹defined as:

 typedef struct{ long w; long h; float* data; }Matrix; 

ReduceTotal () simply calls the standard NVIDIA abbreviation, sums all the elements in Dchunk and puts the response in DcolSum.

I am going to do all this on the CPU if I cannot find the answer ...; (

Thank you very much in advance,

+2
source share
3 answers

Instead of iterating over each column, parallelize the columns. Each of the 4,600 threads sums 9,600 records in its own column and puts the sum in the appropriate place in the result vector.

If you're looking for a library to make working with Cuda easier, I highly recommend Thrust: http://code.google.com/p/thrust/

Using Thrust, I would create a functor to hold a pointer to a matrix in the deviceโ€™s memory and then match it according to a sequence of column indices. The functor operator () takes an index, sums everything in this column of the matrix and returns the sum. Then you will have your amount sitting in the traction :: device_vector without any copies of memory (or even direct CUDA calls).

Your functor might look something like this:

 struct ColumnSumFunctor { const Matrix matrix; // Make a functor to sum the matrix InitialGuessFunctor(const Matrix& matrix); // Compute and return the sum of the specified column __device__ int operator()(const int& column) const; }; 
+3
source

Reduction is a very simple operation in GPGPU, it should be fast, and 9600 times reduction should not slow down either.

Which graphic card are you using?

I suggest you split it into 9600 arrays, every time you reduce an array of 4800 elements into one result. Instead of shortening Total, I suggest you use CUDPP to perform the shrinking operation; CUDPP is similar to STL for CUDA. This is a concern with performance.

http://code.google.com/p/cudpp/

+1
source

I think your problem is that you are running 9600X2 kernels. It should be a simple algorithm to express as a single core.

The most naive way to implement this will not be to combine memory, but it may be faster than how you do it now.

As soon as you get a naive way of working, then combine your memory: each thread in a block reads 16 consecutive floats into shared memory, syncthreads, and then accumulates the corresponding 16 floats in a register, synthreads, and then repeats

There are many examples of reduction methods in the Computing SDK.

0
source

All Articles