Suppose I have a sequence x(n) whose length is K * N and that only the first elements of N are nonzero. I assume that N << K , say, for example, N = 10 and K = 100000 . I want to calculate FFT, FFTW, such a sequence. This is equivalent to an order of length N and has zero complement to K * N Since N and K can be “large,” I have a significant null complement. I am studying whether I can save some computation time by avoiding explicitly filling in zero.
Case K = 2
We start by considering the case K = 2 . In this case, the DFT x(n) can be written as

If K even, namely k = 2 * m , then

which means that such DFT values can be calculated using the FFT sequence of length N , and not K * N
If K is odd, namely k = 2 * m + 1 , then

which means that such DFT values can be calculated again using the FFT sequence of length N , not K * N
So in conclusion, I can exchange one FFT of length 2 * N with 2 FFT of length N
The case of arbitrary K
In this case, we have

When writing k = m * K + t we have

So, in conclusion, I can exchange one FFT of length K * N with K FFT of length N Since FFTW has fftw_plan_many_dft , I can expect to get some increase against the case of a single FFT.
To make sure I installed the following code
#include <stdio.h> #include <stdlib.h> /* srand, rand */ #include <time.h> /* time */ #include <math.h> #include <fstream> #include <fftw3.h> #include "TimingCPU.h" #define PI_d 3.141592653589793 void main() { const int N = 10; const int K = 100000; fftw_plan plan_zp; fftw_complex *h_x = (fftw_complex *)malloc(N * sizeof(fftw_complex)); fftw_complex *h_xzp = (fftw_complex *)calloc(N * K, sizeof(fftw_complex)); fftw_complex *h_xpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhatpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhatpruning_temp = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); fftw_complex *h_xhat = (fftw_complex *)malloc(N * K * sizeof(fftw_complex)); // --- Random number generation of the data sequence srand(time(NULL)); for (int k = 0; k < N; k++) { h_x[k][0] = (double)rand() / (double)RAND_MAX; h_x[k][1] = (double)rand() / (double)RAND_MAX; } memcpy(h_xzp, h_x, N * sizeof(fftw_complex)); plan_zp = fftw_plan_dft_1d(N * K, h_xzp, h_xhat, FFTW_FORWARD, FFTW_ESTIMATE); fftw_plan plan_pruning = fftw_plan_many_dft(1, &N, K, h_xpruning, NULL, 1, N, h_xhatpruning_temp, NULL, 1, N, FFTW_FORWARD, FFTW_ESTIMATE); TimingCPU timerCPU; timerCPU.StartCounter(); fftw_execute(plan_zp); printf("Stadard %f\n", timerCPU.GetCounter()); timerCPU.StartCounter(); double factor = -2. * PI_d / (K * N); for (int k = 0; k < K; k++) { double arg1 = factor * k; for (int n = 0; n < N; n++) { double arg = arg1 * n; double cosarg = cos(arg); double sinarg = sin(arg); h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg; h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg; } } printf("Optimized first step %f\n", timerCPU.GetCounter()); timerCPU.StartCounter(); fftw_execute(plan_pruning); printf("Optimized second step %f\n", timerCPU.GetCounter()); timerCPU.StartCounter(); for (int k = 0; k < K; k++) { for (int p = 0; p < N; p++) { h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0]; h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1]; } } printf("Optimized third step %f\n", timerCPU.GetCounter()); double rmserror = 0., norm = 0.; for (int n = 0; n < N; n++) { rmserror = rmserror + (h_xhatpruning[n][0] - h_xhat[n][0]) * (h_xhatpruning[n][0] - h_xhat[n][0]) + (h_xhatpruning[n][1] - h_xhat[n][1]) * (h_xhatpruning[n][1] - h_xhat[n][1]); norm = norm + h_xhat[n][0] * h_xhat[n][0] + h_xhat[n][1] * h_xhat[n][1]; } printf("rmserror %f\n", 100. * sqrt(rmserror / norm)); fftw_destroy_plan(plan_zp); }
The approach that I developed consists of three steps:
- Multiplication of the input sequence by complex exponents "twiddle";
- Running
fftw_many ; - Reorganization of results.
fftw_many faster than one FFTW at K * N input points. However, steps No. 1 and No. 3 completely destroy this gain. I would expect steps number 1 and number 3 to be calculated much easier than step number 2.
My questions:
- How is it possible that steps # 1 and # 3 are so much more complicated than step # 2?
- How can I improve steps # 1 and # 3 to get a net profit from the “standard” approach?
Thanks so much for any hint.
EDIT
I work with Visual Studio 2013 and compile in release mode.