FFTW trimming acceleration to avoid massive zero padding

Question

FFTW trimming acceleration to avoid massive zero padding

Suppose I have a sequence x(n) whose length is K * N and that only the first elements of N are nonzero. I assume that N << K , say, for example, N = 10 and K = 100000 . I want to calculate FFT, FFTW, such a sequence. This is equivalent to an order of length N and has zero complement to K * N Since N and K can be “large,” I have a significant null complement. I am studying whether I can save some computation time by avoiding explicitly filling in zero.

Case K = 2

We start by considering the case K = 2 . In this case, the DFT x(n) can be written as

If K even, namely k = 2 * m , then

which means that such DFT values can be calculated using the FFT sequence of length N , and not K * N

If K is odd, namely k = 2 * m + 1 , then

which means that such DFT values can be calculated again using the FFT sequence of length N , not K * N

So in conclusion, I can exchange one FFT of length 2 * N with 2 FFT of length N

The case of arbitrary K

In this case, we have

When writing k = m * K + t we have

So, in conclusion, I can exchange one FFT of length K * N with K FFT of length N Since FFTW has fftw_plan_many_dft , I can expect to get some increase against the case of a single FFT.

To make sure I installed the following code

 #include <stdio.h> #include <stdlib.h> /* srand, rand */ #include <time.h> /* time */ #include <math.h> #include <fstream> #include <fftw3.h> #include "TimingCPU.h" #define PI_d 3.141592653589793 void main() { const int N = 10; const int K = 100000; fftw_plan plan_zp; fftw_complex *h_x = (fftw_complex *)malloc(N * sizeof(fftw_complex)); fftw_complex *h_xzp = (fftw_complex *)calloc(N * K, sizeof(fftw_complex)); fftw_complex *h_xpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhatpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhatpruning_temp = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhat = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); // --- Random number generation of the data sequence srand(time(NULL)); for (int k = 0; k < N; k++) { h_x[k][0] = (double)rand() / (double)RAND_MAX; h_x[k][1] = (double)rand() / (double)RAND_MAX; } memcpy(h_xzp, h_x, N * sizeof(fftw_complex)); plan_zp = fftw_plan_dft_1d(N * K, h_xzp, h_xhat, FFTW_FORWARD, FFTW_ESTIMATE); fftw_plan plan_pruning = fftw_plan_many_dft(1, &N, K, h_xpruning, NULL, 1, N, h_xhatpruning_temp, NULL, 1, N, FFTW_FORWARD, FFTW_ESTIMATE); TimingCPU timerCPU; timerCPU.StartCounter(); fftw_execute(plan_zp); printf("Stadard %f\n", timerCPU.GetCounter()); timerCPU.StartCounter(); double factor = -2. * PI_d / (K * N); for (int k = 0; k < K; k++) { double arg1 = factor * k; for (int n = 0; n < N; n++) { double arg = arg1 * n; double cosarg = cos(arg); double sinarg = sin(arg); h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg; h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg; } } printf("Optimized first step %f\n", timerCPU.GetCounter()); timerCPU.StartCounter(); fftw_execute(plan_pruning); printf("Optimized second step %f\n", timerCPU.GetCounter()); timerCPU.StartCounter(); for (int k = 0; k < K; k++) { for (int p = 0; p < N; p++) { h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0]; h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1]; } } printf("Optimized third step %f\n", timerCPU.GetCounter()); double rmserror = 0., norm = 0.; for (int n = 0; n < N; n++) { rmserror = rmserror + (h_xhatpruning[n][0] - h_xhat[n][0]) * (h_xhatpruning[n][0] - h_xhat[n][0]) + (h_xhatpruning[n][1] - h_xhat[n][1]) * (h_xhatpruning[n][1] - h_xhat[n][1]); norm = norm + h_xhat[n][0] * h_xhat[n][0] + h_xhat[n][1] * h_xhat[n][1]; } printf("rmserror %f\n", 100. * sqrt(rmserror / norm)); fftw_destroy_plan(plan_zp); }

The approach that I developed consists of three steps:

Multiplication of the input sequence by complex exponents "twiddle";
Running fftw_many ;
Reorganization of results.

fftw_many faster than one FFTW at K * N input points. However, steps No. 1 and No. 3 completely destroy this gain. I would expect steps number 1 and number 3 to be calculated much easier than step number 2.

My questions:

How is it possible that steps # 1 and # 3 are so much more complicated than step # 2?
How can I improve steps # 1 and # 3 to get a net profit from the “standard” approach?

Thanks so much for any hint.

EDIT

I work with Visual Studio 2013 and compile in release mode.

+7

c ++ c fft fftw

Jackolantern Nov 16 '16 at 15:46

source share

3 answers

Several startup options are faster:

Start multithreading if you only work with single-threaded and have several cores.
Create and save an FFTW wisdom file, especially if the FFT sizes are known in advance. Use FFTW_EXHAUSTIVE and reload the wisdom of FFTW, rather than recount it every time. It is also important if you want your results to be consistent. Because FFTW can calculate FFT in different ways with different calculated wisdom, and the results of wisdom will not necessarily always be the same, different runs of your process can lead to different results when both have the same input.
If you are on x86, run the 64-bit version. The FFTW algorithm is extremely intensive for registers, and the x86 processor working in 64-bit mode has much more universal registers than when working in 32-bit mode.
Because the FFTW algorithm is so case-sensitive that I have had good success improving FFTW performance by compiling FFTW with compiler options that prevent the use of prefetching and prevent implicit function nesting.

+5

Andrew Henle Nov 16 '16 at 16:12

source share

Also after the comments of Paul R, I improved my code. Now the alternative approach is faster than the standard (with zero addition). Below is the full C ++ script. For step # 1 and # 3, I commented on other proven solutions that showed that they were slower or faster than without injury. I have privileged non-nested for loops, also in view of the simpler future parallelization of CUDA. I am not using multithreading for FFTW yet.

 #include <stdio.h> #include <stdlib.h> /* srand, rand */ #include <time.h> /* time */ #include <math.h> #include <fstream> #include <omp.h> #include <fftw3.h> #include "TimingCPU.h" #define PI_d 3.141592653589793 /******************/ /* STEP #1 ON CPU */ /******************/ void step1CPU(fftw_complex * __restrict h_xpruning, const fftw_complex * __restrict h_x, const int N, const int K) { // double factor = -2. * PI_d / (K * N); // int n; // omp_set_nested(1); //#pragma omp parallel for private(n) num_threads(4) // for (int k = 0; k < K; k++) { // double arg1 = factor * k; //#pragma omp parallel for num_threads(4) // for (n = 0; n < N; n++) { // double arg = arg1 * n; // double cosarg = cos(arg); // double sinarg = sin(arg); // h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg; // h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg; // } // } //double factor = -2. * PI_d / (K * N); //int k; //omp_set_nested(1); //#pragma omp parallel for private(k) num_threads(4) //for (int n = 0; n < N; n++) { // double arg1 = factor * n; // #pragma omp parallel for num_threads(4) // for (k = 0; k < K; k++) { // double arg = arg1 * k; // double cosarg = cos(arg); // double sinarg = sin(arg); // h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg; // h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg; // } //} //double factor = -2. * PI_d / (K * N); //for (int k = 0; k < K; k++) { // double arg1 = factor * k; // for (int n = 0; n < N; n++) { // double arg = arg1 * n; // double cosarg = cos(arg); // double sinarg = sin(arg); // h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg; // h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg; // } //} //double factor = -2. * PI_d / (K * N); //for (int n = 0; n < N; n++) { // double arg1 = factor * n; // for (int k = 0; k < K; k++) { // double arg = arg1 * k; // double cosarg = cos(arg); // double sinarg = sin(arg); // h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg; // h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg; // } //} double factor = -2. * PI_d / (K * N); #pragma omp parallel for num_threads(8) for (int n = 0; n < K * N; n++) { int row = n / N; int col = n % N; double arg = factor * row * col; double cosarg = cos(arg); double sinarg = sin(arg); h_xpruning[n][0] = h_x[col][0] * cosarg - h_x[col][1] * sinarg; h_xpruning[n][1] = h_x[col][0] * sinarg + h_x[col][1] * cosarg; } } /******************/ /* STEP #3 ON CPU */ /******************/ void step3CPU(fftw_complex * __restrict h_xhatpruning, const fftw_complex * __restrict h_xhatpruning_temp, const int N, const int K) { //int k; //omp_set_nested(1); //#pragma omp parallel for private(k) num_threads(4) //for (int p = 0; p < N; p++) { // #pragma omp parallel for num_threads(4) // for (k = 0; k < K; k++) { // h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0]; // h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1]; // } //} //int p; //omp_set_nested(1); //#pragma omp parallel for private(p) num_threads(4) //for (int k = 0; k < K; k++) { // #pragma omp parallel for num_threads(4) // for (p = 0; p < N; p++) { // h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0]; // h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1]; // } //} //for (int p = 0; p < N; p++) { // for (int k = 0; k < K; k++) { // h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0]; // h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1]; // } //} //for (int k = 0; k < K; k++) { // for (int p = 0; p < N; p++) { // h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0]; // h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1]; // } //} #pragma omp parallel for num_threads(8) for (int p = 0; p < K * N; p++) { int col = p % N; int row = p / K; h_xhatpruning[col * K + row][0] = h_xhatpruning_temp[col + row * N][0]; h_xhatpruning[col * K + row][1] = h_xhatpruning_temp[col + row * N][1]; } //for (int p = 0; p < N; p += 2) { // for (int k = 0; k < K; k++) { // for (int p0 = 0; p0 < 2; p0++) { // h_xhatpruning[(p + p0) * K + k][0] = h_xhatpruning_temp[(p + p0) + k * N][0]; // h_xhatpruning[(p + p0) * K + k][1] = h_xhatpruning_temp[(p + p0) + k * N][1]; // } // } //} } /********/ /* MAIN */ /********/ void main() { int N = 10; int K = 100000; // --- CPU memory allocations fftw_complex *h_x = (fftw_complex *)malloc(N * sizeof(fftw_complex)); fftw_complex *h_xzp = (fftw_complex *)calloc(N * K, sizeof(fftw_complex)); fftw_complex *h_xpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhatpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhatpruning_temp = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhat = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); //double2 *h_xhatGPU = (double2 *)malloc(N * K * sizeof(double2)); // --- Random number generation of the data sequence on the CPU - moving the data from CPU to GPU srand(time(NULL)); for (int k = 0; k < N; k++) { h_x[k][0] = (double)rand() / (double)RAND_MAX; h_x[k][1] = (double)rand() / (double)RAND_MAX; } //gpuErrchk(cudaMemcpy(d_x, h_x, N * sizeof(double2), cudaMemcpyHostToDevice)); memcpy(h_xzp, h_x, N * sizeof(fftw_complex)); // --- FFTW and cuFFT plans fftw_plan h_plan_zp = fftw_plan_dft_1d(N * K, h_xzp, h_xhat, FFTW_FORWARD, FFTW_ESTIMATE); fftw_plan h_plan_pruning = fftw_plan_many_dft(1, &N, K, h_xpruning, NULL, 1, N, h_xhatpruning_temp, NULL, 1, N, FFTW_FORWARD, FFTW_ESTIMATE); double totalTimeCPU = 0., totalTimeGPU = 0.; double partialTimeCPU, partialTimeGPU; /****************************/ /* STANDARD APPROACH ON CPU */ /****************************/ printf("Number of processors available = %i\n", omp_get_num_procs()); printf("Number of threads = %i\n", omp_get_max_threads()); TimingCPU timerCPU; timerCPU.StartCounter(); fftw_execute(h_plan_zp); printf("\nStadard on CPU: \t \t %f\n", timerCPU.GetCounter()); /******************/ /* STEP #1 ON CPU */ /******************/ timerCPU.StartCounter(); step1CPU(h_xpruning, h_x, N, K); partialTimeCPU = timerCPU.GetCounter(); totalTimeCPU = totalTimeCPU + partialTimeCPU; printf("\nOptimized first step CPU: \t %f\n", totalTimeCPU); /******************/ /* STEP #2 ON CPU */ /******************/ timerCPU.StartCounter(); fftw_execute(h_plan_pruning); partialTimeCPU = timerCPU.GetCounter(); totalTimeCPU = totalTimeCPU + partialTimeCPU; printf("Optimized second step CPU: \t %f\n", timerCPU.GetCounter()); /******************/ /* STEP #3 ON CPU */ /******************/ timerCPU.StartCounter(); step3CPU(h_xhatpruning, h_xhatpruning_temp, N, K); partialTimeCPU = timerCPU.GetCounter(); totalTimeCPU = totalTimeCPU + partialTimeCPU; printf("Optimized third step CPU: \t %f\n", partialTimeCPU); printf("Total time CPU: \t \t %f\n", totalTimeCPU); double rmserror = 0., norm = 0.; for (int n = 0; n < N; n++) { rmserror = rmserror + (h_xhatpruning[n][0] - h_xhat[n][0]) * (h_xhatpruning[n][0] - h_xhat[n][0]) + (h_xhatpruning[n][1] - h_xhat[n][1]) * (h_xhatpruning[n][1] - h_xhat[n][1]); norm = norm + h_xhat[n][0] * h_xhat[n][0] + h_xhat[n][1] * h_xhat[n][1]; } printf("\nrmserror %f\n", 100. * sqrt(rmserror / norm)); fftw_destroy_plan(h_plan_zp); }

For case

 N = 10 K = 100000

my time is next

 Stadard on CPU: 23.895417 Optimized first step CPU: 4.472087 Optimized second step CPU: 4.926603 Optimized third step CPU: 2.394958 Total time CPU: 11.793648

+1

Jackolantern Nov 18 '16 at 15:39

source share

Paul r · Accepted Answer · 2016-11-17T09:38:50+0000

For the third step, you may need to switch the order of the loops:

 for (int p = 0; p < N; p++) { for (int k = 0; k < K; k++) { h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0]; h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1]; } }

since it is usually more advantageous to have storage addresses adjacent than load addresses.

In any case, you have an insecure cache access pattern. You can try working with blocks to improve this, for example. assuming N is a multiple of 4:

 for (int p = 0; p < N; p += 4) { for (int k = 0; k < K; k++) { for (int p0 = 0; p0 < 4; p0++) { h_xhatpruning[(p + p0) * K + k][0] = h_xhatpruning_temp[(p + p0) + k * N][0]; h_xhatpruning[(p + p0) * K + k][1] = h_xhatpruning_temp[(p + p0) + k * N][1]; } } }

This should help slightly reduce the number of cache lines. If so, then perhaps also experiment with block sizes other than 4 to see if there is a “sweet spot”.

FFTW trimming acceleration to avoid massive zero padding

More articles: