PGFFT: A Pretty Good FFT


PGFFT is C++ code for performing a complex FFT, distributed under a FreeBSD license.

PGFFT requires C++11.

PGFFT provides a very simple interface. For example:

   #include "PGFFT.h"
     ...
   PGFFT fft(n); 
   // Initialize data structures for n-point FFT.
   // n can be any positive integer.
   // Uses Bluestein's algorithm with truncated FFT
   // for non-powers-of-two.
     ...
   const std::complex<double> *src;
   std::complex<double> *dst;
     ...
   fft.apply(src, dst); 
   // Apply n-point FFT to src[0..n-1], storing result in dst[0..n-1].
   // That is,
   //   dst[i] = \sum_{j=0}^{n-1} src[j] W^{ij}
   // and where W is the nth root of unity exp(-2*pi*\sqrt{-1}/n).
   // src and dst may be equal, but should not otherwise overlap.
     ...
   fft.apply(dst);
   // same as fft.apply(dst, dst);

PGFFT is written in pure C++, except for the core inner loops, which may use specialized SIMD intrinsics for better performance. Currently, the only architectures with SIMD support are AVX and AVX2. The code automatically makes this choice when compiling PGFFT.cpp, based on pre-defined macros, but you will typically have to pass a compiler option like -march=native to enable the use of SIMD intrinsics. You may also completely disable the use of these intrinsics by passing the option -DPGFFT_DISABLE_SIMD when compiling PGFFT.cpp.

Downloading and using PGFFT. Just download and unpack the latest tarball PGFFT-xxx.tar, and unpack.

You will end up with several files into a directory PGFFT-xxx. In that directory, the only files you really need are PGFFT.cpp and PGFFT.h. There is no configuring or making necessary: just include these two files in your own project.

For best performance, it is recommended to compile PGFFT.cpp with the compiler options -O2 -march=native. Surprisingly, on the system I used to run the above performance tests, -O2 yielded significantly faster code than -O3. Your mileage may vary. You may pass the option -DPGFFT_DISABLE_SIMD to disable SIMD intrinsics. You may need to pass the option -std=c++11 to get C++11.

The directory PGFFT-xxx contains a few other files that may be helpful:

I have tested PGFFT on macOS using CLANG, Linux using GCC (versions 4.8.5 and 7.3.1), and Linux using ICC.

Performance: comparison against FFTW. I compared its performance to that of FFTW, which nowadays is the "gold standard" for FFT implementations. Note that FFTW is available only under a GPL license, or under less restrictive licenses that cost $$.

In this comparison, I used GCC 7.3.1 on a Haswell (AVX2) processor (Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz). PGFFT was compiled with compiler options --march=native -O2 while FFTW (version 3.3.8) was configured with --enable-avx2 . The timings were done using the program PGFFT_time1.cpp, which is included in the downloads. For each n, and each algorithm, the best time was taken from three runs of an FFT of size n. The first four sizes are powers of two, the last four sizes are primes. Times are in microseconds, as measured by getrusage (user plus system). Times do not include pre-computation of tables based on the transform size.

n               1024     4096    16384    65536     1409     6007    24001    96001
PGFFT (ms)   6.2e-03  3.2e-02  1.7e-01  8.1e-01  4.4e-02  2.3e-01  1.1e+00  5.4e+00
FFTW  (ms)   2.2e-03  1.5e-02  1.0e-01  4.7e-01  3.2e-02  2.2e-01  1.0e+00  4.9e+00
Here are times for larger power-of-two sizes (256K, 1M, 4M, 16MB):
n             262144  1048576  4194304 16777216
PGFFT (ms)   3.8e+00  2.2e+01  1.2e+02  5.7e+02
FFTW  (ms)   2.1e+00  1.5e+01  1.1e+02  5.0e+02

So one can see that for sizes that are powers of two and primes, PGFFT is competitive with FFTW.

In fairness, FFTW performs much better than PGFFT on "very smooth" sizes that are not powers of two:

n              19683    30030
PGFFT (ms)   1.0e+00  1.5e+00
FFTW  (ms)   1.7e-01  2.5e-01

Here, the first size is 39, and the second is 2 · 3 · 5 · 7 · 11 · 13. For these sizes, PGFFT is using Bluestein and FFTW is (presumably) using a direct recursion, which explains the increased relative performance of FFTW.

In contrast, here are times for some not-very-smooth composite sizes:

n              18631    21845    28679
PGFFT (ms)   1.0e+00  1.1e+00  1.4e+00
FFTW  (ms)   8.0e-01  7.4e-01  1.0e+00
These sizes factor as: 31·601, 5·17·257, 7·17·241. So PGFFT is competitive with FFTW on these types of sizes.

Also in fairness, FFTW is built to exploit SIMD instructions on a much wider array of machines than PGFFT. If anyone out there wants to contribute SIMD code for other machines, let me know!

Also in fairness, FFTW provides much more functionality than PGFFT.

All of the above FFTW times are based on "plans" built with the FFTW_MEASURE option, which determines the best algorithm based on run-time measurements. This process takes some time (usually a few seconds), depending on the machine and on the size of the transform. Here are some corresponding times based on plans built with the FFTW_ESTIMATE option, which uses compile-time heuristics to determine which algorithm to use.

n               1024     4096    16384    65536     1409     6007    24001    96001
PGFFT (ms)   6.1e-03  3.1e-02  1.6e-01  8.0e-01  4.3e-02  2.3e-01  1.1e+00  5.4e+00
FFTW  (ms)   4.1e-03  3.0e-02  1.7e-01  7.8e-01  5.7e-02  3.4e-01  1.2e+00  6.2e+00

n             262144  1048576  4194304 16777216
PGFFT (ms)   3.7e+00  2.2e+01  1.2e+02  5.7e+02
FFTW  (ms)   4.9e+00  2.2e+01  2.0e+02  9.5e+02

n              19683    30030
PGFFT (ms)   1.0e+00  1.4e+00
FFTW  (ms)   2.3e-01  3.3e-01

n              18631    21845    28679
PGFFT (ms)   1.0e+00  1.1e+00  1.4e+00
FFTW  (ms)   1.0e+00  1.0e+00  1.4e+00
One can see that with this option, PGFFT is very competitive with (and sometimes faster than) FFTW for power-of-two and prime transform sizes. However, FFTW still does much better than PGFFT for very smooth sizes that are not powers of two.

Why does this software exist? For another project, I needed a pretty good FFT with performance approaching that of FFTW, but without a restrictive license like GPL. For that project, I was mainly interested in transforms that are of size in the range 15,000 to 60,000, which is either a power of two, or some number that typically has two or three prime factors. After some looking around on the Internet, I failed to find any such code.

I developed this code by re-purposing NTL's small-prime FFT (a.k.a., NTT), which itself is based in part of some earlier code written by David Harvey (see the PGFFT source code more more details).

PocketFFT is another FFT implementation that is comparable to PGFFT, in terms of performance and licensing. I compared PGFFT's performance with that of PocketFFT, and found PGFFT (under the same test conditions as above) to be a bit faster than PocketFFT for the transform sizes I was mainly interested in, while PGFFT is somewhat slower than PocketFFT for sizes that are very smooth non-powers-of-two. I did not know about PocketFFT when I started working on PGFFT. If I had, I may just have used it instead.


Back to Victor Shoup's Home Page