CuFFT Reverse FFT Scaling

Whenever I draw the values ​​obtained by the program using cuFFT and comparing the results with the results of Matlab, I get the same kind of graphs, and the values ​​of the maxima and minima fall at the same points. However, the values ​​obtained by cuFFT are much larger than the values ​​obtained in Matlab. Matlab Code

fs = 1000; % sample freq D = [0:1:4]'; % pulse delay times t = 0 : 1/fs : 4000/fs; % signal evaluation time w = 0.5; % width of each pulse yp = pulstran(t,D,'rectpuls',w); filt = conj(fliplr(yp)); xx = fft(yp,1024).*fft(filt,1024); xx = (abs(ifft(xx))); 

and the CUDA code with the same input is as follows:

 cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD); cufftExecC2C(plan, (cufftComplex *)d_filter_signal, (cufftComplex *)d_filter_signal, CUFFT_FORWARD); ComplexPointwiseMul<<<blocksPerGrid, threadsPerBlock>>>(d_signal, d_filter_signal, NX); cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE); 

cuFFT also performs 1024 FFT points with a lot size of 2 .

With a scaling factor of NX=1024 values ​​will not be correct. Please tell what to do.

+4
source share
3 answers

This is the last answer to remove this question from the unanswered list.

You are not providing enough information to diagnose your problem, as you are lacking in how you set up the cuFFT plan. You haven't even indicated if you have exactly the same shape for Matlab and cuFFT signals (so you only have scaling), or you have roughly the same shape. However, let me make the following two observations:

  • The yp vector has 4000 elements; opposite them through fft(yp,1024) , you perform an FFT by trimming the signal to elements 1024 ;
  • Inverse cuFFT does not scale by the number of vector elements.

For convenience (this may be useful for other users), I will report below a simple FFT-IFFT scheme, which also includes scaling using the CUDA Thrust library.

 #include <cufft.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> /*********************/ /* SCALE BY CONSTANT */ /*********************/ class Scale_by_constant { private: float c_; public: Scale_by_constant(float c) { c_ = c; }; __host__ __device__ float2 operator()(float2 &a) const { float2 output; output.x = ax / c_; output.y = ay / c_; return output; } }; int main(void){ const int N=4; // --- Setting up input device vector thrust::device_vector<float2> d_vec(N,make_cuComplex(1.f,2.f)); cufftHandle plan; cufftPlan1d(&plan, N, CUFFT_C2C, 1); // --- Perform in-place direct Fourier transform cufftExecC2C(plan, thrust::raw_pointer_cast(d_vec.data()),thrust::raw_pointer_cast(d_vec.data()), CUFFT_FORWARD); // --- Perform in-place inverse Fourier transform cufftExecC2C(plan, thrust::raw_pointer_cast(d_vec.data()),thrust::raw_pointer_cast(d_vec.data()), CUFFT_INVERSE); thrust::transform(d_vec.begin(), d_vec.end(), d_vec.begin(), Scale_by_constant((float)(N))); // --- Setting up output host vector thrust::host_vector<float2> h_vec(d_vec); for (int i=0; i<N; i++) printf("Element #%i; Real part = %f; Imaginary part: %f\n",i,h_vec[i].x,h_vec[i].y); getchar(); } 
+4
source

With the introduction of the cuFFT callback function, the normalization required by the inverse FFT performed by cuFFT can be built directly into the cufftExecC2C call, defining the normalization operation as the __device__ function.

In addition to the cuFFT user guide, for cuFFT callback functions, see

CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing

The following is an example of implementing IFFT normalization using the cuFFT callback.

  #include <stdio.h> #include <assert.h> #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <cufft.h> #include <cufftXt.h> /********************/ /* CUDA ERROR CHECK */ /********************/ #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); } inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true) { if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) exit(code); } } /*********************/ /* CUFFT ERROR CHECK */ /*********************/ // See http://stackoverflow.com/questions/16267149/cufft-error-handling #ifdef _CUFFT_H_ static const char *_cudaGetErrorEnum(cufftResult error) { switch (error) { case CUFFT_SUCCESS: return "CUFFT_SUCCESS"; case CUFFT_INVALID_PLAN: return "CUFFT_INVALID_PLAN"; case CUFFT_ALLOC_FAILED: return "CUFFT_ALLOC_FAILED"; case CUFFT_INVALID_TYPE: return "CUFFT_INVALID_TYPE"; case CUFFT_INVALID_VALUE: return "CUFFT_INVALID_VALUE"; case CUFFT_INTERNAL_ERROR: return "CUFFT_INTERNAL_ERROR"; case CUFFT_EXEC_FAILED: return "CUFFT_EXEC_FAILED"; case CUFFT_SETUP_FAILED: return "CUFFT_SETUP_FAILED"; case CUFFT_INVALID_SIZE: return "CUFFT_INVALID_SIZE"; case CUFFT_UNALIGNED_DATA: return "CUFFT_UNALIGNED_DATA"; } return "<unknown>"; } #endif #define cufftSafeCall(err) __cufftSafeCall(err, __FILE__, __LINE__) inline void __cufftSafeCall(cufftResult err, const char *file, const int line) { if( CUFFT_SUCCESS != err) { fprintf(stderr, "CUFFT error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \ _cudaGetErrorEnum(err)); \ cudaDeviceReset(); assert(0); \ } } __device__ void IFFT_Scaling(void *dataOut, size_t offset, cufftComplex element, void *callerInfo, void *sharedPtr) { float *scaling_factor = (float*)callerInfo; float2 output; output.x = cuCrealf(element); output.y = cuCimagf(element); output.x = output.x / scaling_factor[0]; output.y = output.y / scaling_factor[0]; ((float2*)dataOut)[offset] = output; } __device__ cufftCallbackStoreC d_storeCallbackPtr = IFFT_Scaling; /********/ /* MAIN */ /********/ int main() { const int N = 16; cufftHandle plan; float2 *h_input = (float2*)malloc(N*sizeof(float2)); float2 *h_output1 = (float2*)malloc(N*sizeof(float2)); float2 *h_output2 = (float2*)malloc(N*sizeof(float2)); float2 *d_input; gpuErrchk(cudaMalloc((void**)&d_input, N*sizeof(float2))); float2 *d_output1; gpuErrchk(cudaMalloc((void**)&d_output1, N*sizeof(float2))); float2 *d_output2; gpuErrchk(cudaMalloc((void**)&d_output2, N*sizeof(float2))); float *h_scaling_factor = (float*)malloc(sizeof(float)); h_scaling_factor[0] = 16.0f; float *d_scaling_factor; gpuErrchk(cudaMalloc((void**)&d_scaling_factor, sizeof(float))); gpuErrchk(cudaMemcpy(d_scaling_factor, h_scaling_factor, sizeof(float), cudaMemcpyHostToDevice)); for (int i=0; i<N; i++) { h_input[i].x = 1.0f; h_input[i].y = 0.f; } gpuErrchk(cudaMemcpy(d_input, h_input, N*sizeof(float2), cudaMemcpyHostToDevice)); cufftSafeCall(cufftPlan1d(&plan, N, CUFFT_C2C, 1)); cufftSafeCall(cufftExecC2C(plan, d_input, d_output1, CUFFT_FORWARD)); gpuErrchk(cudaMemcpy(h_output1, d_output1, N*sizeof(float2), cudaMemcpyDeviceToHost)); for (int i=0; i<N; i++) printf("Direct transform - %d - (%f, %f)\n", i, h_output1[i].x, h_output1[i].y); cufftCallbackStoreC h_storeCallbackPtr; gpuErrchk(cudaMemcpyFromSymbol(&h_storeCallbackPtr, d_storeCallbackPtr, sizeof(h_storeCallbackPtr))); cufftSafeCall(cufftXtSetCallback(plan, (void **)&h_storeCallbackPtr, CUFFT_CB_ST_COMPLEX, (void **)&d_scaling_factor)); cufftSafeCall(cufftExecC2C(plan, d_output1, d_output2, CUFFT_INVERSE)); gpuErrchk(cudaMemcpy(h_output2, d_output2, N*sizeof(float2), cudaMemcpyDeviceToHost)); for (int i=0; i<N; i++) printf("Inverse transform - %d - (%f, %f)\n", i, h_output2[i].x, h_output2[i].y); cufftSafeCall(cufftDestroy(plan)); gpuErrchk(cudaFree(d_input)); gpuErrchk(cudaFree(d_output1)); gpuErrchk(cudaFree(d_output2)); return 0; } 

EDIT

The “moment” performed by the callback operation is specified in CUFFT_CB_ST_COMPLEX when calling cufftXtSetCallback . Please note that after this you can load and save callbacks with the same cuFFT plan.

+4
source

EXECUTIONS

I am adding another answer to compare the performance of the callback with the version without calling back the same code for this particular IFFT scaling case. The code I'm using is

 #include <stdio.h> #include <assert.h> #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <cufft.h> #include <cufftXt.h> #include <thrust/device_vector.h> #include "Utilities.cuh" #include "TimingGPU.cuh" //#define DISPLAY /*******************************/ /* THRUST FUNCTOR IFFT SCALING */ /*******************************/ class Scale_by_constant { private: float c_; public: Scale_by_constant(float c) { c_ = c; }; __host__ __device__ float2 operator()(float2 &a) const { float2 output; output.x = ax / c_; output.y = ay / c_; return output; } }; /**********************************/ /* IFFT SCALING CALLBACK FUNCTION */ /**********************************/ __device__ void IFFT_Scaling(void *dataOut, size_t offset, cufftComplex element, void *callerInfo, void *sharedPtr) { float *scaling_factor = (float*)callerInfo; float2 output; output.x = cuCrealf(element); output.y = cuCimagf(element); output.x = output.x / scaling_factor[0]; output.y = output.y / scaling_factor[0]; ((float2*)dataOut)[offset] = output; } __device__ cufftCallbackStoreC d_storeCallbackPtr = IFFT_Scaling; /********/ /* MAIN */ /********/ int main() { const int N = 100000000; cufftHandle plan; cufftSafeCall(cufftPlan1d(&plan, N, CUFFT_C2C, 1)); TimingGPU timerGPU; float2 *h_input = (float2*)malloc(N*sizeof(float2)); float2 *h_output1 = (float2*)malloc(N*sizeof(float2)); float2 *h_output2 = (float2*)malloc(N*sizeof(float2)); float2 *d_input; gpuErrchk(cudaMalloc((void**)&d_input, N*sizeof(float2))); float2 *d_output1; gpuErrchk(cudaMalloc((void**)&d_output1, N*sizeof(float2))); float2 *d_output2; gpuErrchk(cudaMalloc((void**)&d_output2, N*sizeof(float2))); // --- Callback function parameters float *h_scaling_factor = (float*)malloc(sizeof(float)); h_scaling_factor[0] = 16.0f; float *d_scaling_factor; gpuErrchk(cudaMalloc((void**)&d_scaling_factor, sizeof(float))); gpuErrchk(cudaMemcpy(d_scaling_factor, h_scaling_factor, sizeof(float), cudaMemcpyHostToDevice)); // --- Initializing the input on the host and moving it to the device for (int i = 0; i < N; i++) { h_input[i].x = 1.0f; h_input[i].y = 0.f; } gpuErrchk(cudaMemcpy(d_input, h_input, N * sizeof(float2), cudaMemcpyHostToDevice)); // --- Execute direct FFT on the device and move the results to the host cufftSafeCall(cufftExecC2C(plan, d_input, d_output1, CUFFT_FORWARD)); #ifdef DISPLAY gpuErrchk(cudaMemcpy(h_output1, d_output1, N * sizeof(float2), cudaMemcpyDeviceToHost)); for (int i=0; i<N; i++) printf("Direct transform - %d - (%f, %f)\n", i, h_output1[i].x, h_output1[i].y); #endif // --- Execute inverse FFT with subsequent scaling on the device and move the results to the host timerGPU.StartCounter(); cufftSafeCall(cufftExecC2C(plan, d_output1, d_output2, CUFFT_INVERSE)); thrust::transform(thrust::device_pointer_cast(d_output2), thrust::device_pointer_cast(d_output2) + N, thrust::device_pointer_cast(d_output2), Scale_by_constant((float)(N))); #ifdef DISPLAY gpuErrchk(cudaMemcpy(h_output2, d_output2, N * sizeof(float2), cudaMemcpyDeviceToHost)); for (int i=0; i<N; i++) printf("Inverse transform - %d - (%f, %f)\n", i, h_output2[i].x, h_output2[i].y); #endif printf("Timing NO callback %f\n", timerGPU.GetCounter()); // --- Setup store callback // timerGPU.StartCounter(); cufftCallbackStoreC h_storeCallbackPtr; gpuErrchk(cudaMemcpyFromSymbol(&h_storeCallbackPtr, d_storeCallbackPtr, sizeof(h_storeCallbackPtr))); cufftSafeCall(cufftXtSetCallback(plan, (void **)&h_storeCallbackPtr, CUFFT_CB_ST_COMPLEX, (void **)&d_scaling_factor)); // --- Execute inverse callback FFT on the device and move the results to the host timerGPU.StartCounter(); cufftSafeCall(cufftExecC2C(plan, d_output1, d_output2, CUFFT_INVERSE)); #ifdef DISPLAY gpuErrchk(cudaMemcpy(h_output2, d_output2, N * sizeof(float2), cudaMemcpyDeviceToHost)); for (int i=0; i<N; i++) printf("Inverse transform - %d - (%f, %f)\n", i, h_output2[i].x, h_output2[i].y); #endif printf("Timing callback %f\n", timerGPU.GetCounter()); cufftSafeCall(cufftDestroy(plan)); gpuErrchk(cudaFree(d_input)); gpuErrchk(cudaFree(d_output1)); gpuErrchk(cudaFree(d_output2)); return 0; } 

For such large 1D arrays and simple processing (scaling), the time on the Kepler K20c is as follows

 Non-callback 69.029762 ms Callback 65.868607 ms 

So, there are few improvements. I expect that the improvement you see is due to the fact that if you refuse a callback, you can avoid a separate kernel call. For smaller 1D arrays, the callback does not improve or fail faster.

+1
source

All Articles