I am trying to add 4800x9600 matrix rows together, as a result we get a 1x9600 matrix.
What I did was split the 4800x9600 into 9,600 matrices with a length of 4800 each. Then I perform a reduction of 4800 elements.
The problem is that it is very slow ...
Anyone have any suggestions?
Basically, I am trying to implement the sum (...) MATLAB function.
Here is the code I checked that works fine, it is just very slow:
void reduceRows(Matrix Dresult,Matrix DA) { //split DA into chunks Matrix Dchunk; Dchunk.h=1;Dchunk.w=DA.h; cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float)); Matrix DcolSum; DcolSum.h=1;DcolSum.w=1; //cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float)); int i; for(i=0;i<DA.w;i++) //loop over each column { //printf("%d ",i); cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice); DcolSum.data=&Dresult.data[i]; reduceTotal(DcolSum,Dchunk); } cudaFree(Dchunk.data); }
The matrix is โโdefined as:
typedef struct{ long w; long h; float* data; }Matrix;
ReduceTotal () simply calls the standard NVIDIA abbreviation, sums all the elements in Dchunk and puts the response in DcolSum.
I am going to do all this on the CPU if I cannot find the answer ...; (
Thank you very much in advance,