I have a huge problem with the code I'm programming. I am not an expert, and I asked many people before coming here. too much fixed. Therefore, I think I am ready to show you the code and ask you my questions. I will put all the code here so that you understand well what my problem is. The thing I want to do is if ARRAY_SIZE too big for THREAD_SIZE, so I put the data of a large array into a smaller array, specially created with a size of THREAD_SIZE . Then I send it to the kernel and do whatever I need. But I have a problem from the side
isub_matrix[x*THREAD_SIZE+y]=big_matrix[x*ARRAY_SIZE+y];
where the code stops due. First, I made a double pointer to big_matrix. But the people in the #cuda channel on the freenode irc network told me that it was too large for the processor memory to handle this, that I had to create a linear pointer. I did this, but I still have a stack overflow problem. So, here it is ... updated after some changes that have not yet worked (stack overflow has stopped, but the connection and manifest problem has not been completed)
#define ARRAY_SIZE 2048 #define THREAD_SIZE 32 #define PI 3.14 int main(int argc, char** argv) { int array_plus=0,x,y; float time;
Another question about the parallel part. The compiler (Visual Studio) says that I used too many pow () and exp () parameters at once. How do I solve this problem?
if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE)) { block[xIndex][yIndex]=exp(sum_sin[xIndex][yIndex])+exp(sum_cos[xIndex][yIndex]); }
The source code is here. I commented on this because I wanted to know at least my code is gaining some value in the GPU. But he didnβt even start the kernel ... so sad)
__global__ void twiddle_factor(float *isub_matrix, float *osub_matrix) { __shared__ float block[THREAD_SIZE][THREAD_SIZE]; // int x,y,z; unsigned int xIndex = threadIdx.x; unsigned int yIndex = threadIdx.y; /* int sum_sines=0.0; int sum_cosines=0.0; float sum_sin[THREAD_SIZE],sum_cos[THREAD_SIZE]; float angle=(2*PI)/THREAD_SIZE; //put into shared memory the FFT calculation (F(u)) for(x=0;x<THREAD_SIZE;x++) { for(y=0;y<THREAD_SIZE;y++) { for(z=0;z<THREAD_SIZE;z++) { sum_sines=sum_sin+sin(isub_matrix[y*THREAD_SIZE+z]*(angle*z)); sum_cosines=sum_cos+cos(isub_matrix[y*THREAD_SIZE+z]*(angle*z)); } sum_sin[x][y]=sum_sines/THREAD_SIZE; sum_cos[x][y]=sum_cosines/THREAD_SIZE; } } */ if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE)) block[xIndex][yIndex]=pow(THREAD_SIZE,0.5); //block[xIndex][yIndex]=pow(exp(sum_sin[xIndex*THREAD_SIZE+yIndex])+exp(sum_cos[xIndex*THREAD_SIZE+yIndex]),0.5); __syncthreads(); //transposition X x Y //transfer back the results into another sub-matrix that is allocated in CPU if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE)) osub_matrix[yIndex*THREAD_SIZE+xIndex]=block[xIndex][yIndex]; __syncthreads(); }
Thanks for reading all this!
Below is the whole code:
#include <stdlib.h>
Tobio takona
source share