Home

Awesome

meow_fft

My Easy Oresome Wonderfull FFT

By Richard Maxwell

A simple, C99, header only, 0-Clause BSD Licensed, fast fourier transform (FFT).

Example

    #define MEOW_FFT_IMPLEMENTATION
    #include <meow_fft.h>

    #include <malloc.h>

    void main(char** argv, int argv)
    {
        (void) argv;
        (void) argc;

        unsigned          N    = 1024;
        unsigned          N_2  = ((N + 1) / 2);
        float*            in   = malloc(sizeof(float) * N);
        Meow_FFT_Complex* out  = malloc(sizeof(Meow_FFT_Complex) * N_2);
        Meow_FFT_Complex* temp = malloc(sizeof(Meow_FFT_Complex) * N_2);
        // Real only FFTs output half the amount of inputs (N/2 rounded up)
        // Full complex FFTs output all of them (N)
        // This aaplies to the temp array as well (N/2 rounded up)

        // prepare data for "in" array.
        // ...

        size_t workset_bytes = meow_fft_generate_workset_real(N, NULL);
        // Get size for a N point fft working on non-complex (real) data.

        Meow_FFT_Workset_Real* fft_real =
            (Meow_FFT_Workset_Real*) malloc(workset_bytes);

        meow_fft_generate_workset_real(N, fft_real);

        meow_fft_real(fft_real, in, out);
        // out[0].r == out[0  ].r
        // out[0].j == out[N/2].r

        meow_fft_real_i(fft_real, in, temp, out);
        // result is not scaled, need to divide all values by N

        free(fft_real);
        free(out);
        free(temp);
        free(in);
    }

Usage

Since this is a single header library, just make a C file with the lines:

    #define MEOW_FFT_IMPLEMENTATION
    #include <meow_fft.h>

There are two sets of functions. Ones dealing with sequential interleaved floating point complex numbers, and ones dealing with sequential floating point real numbers (postfixed with _real).

Forward FFTs are labelled _fft while reverse FFTs are labelled _fft_i.

The function _is_slow can be used to tell if you have a non-optimised radix calculation in your fft (ie the slow DFT is called). This will also increase the memory requirements required by the workset.

All functions are namespaced with meow_ and all Types by Meow_.

Why?

I thought I could write a faster FFT that kiss_fft, since I couldn't use FFTW3 due to its GPL license. LOL, If I knew about the pffft library, I would have just used that instead.

¯\_(ツ)_/¯

Performance

I found changing compiler flags can make the FFT go faster or slower depending on what you want to do. For example, using gcc with -march=native on my i7 resulted in the > 2048 FFTs going faster, but the < 2048 FFTs going twice as slow. I also got mixed results with -ffast-math. Basically, you need to figure out what FFTs you are going to use, and then benchmark various compiler options for your target platforms in order to get any useful compiler based performance increases.

Reading List

FFT Implementation

I have implemented a non-scaled, float based decimation in time, mixed-radix, out of place, in order result fast fourier transform with sequentially accessed twiddle factors per stage, with separate forward and reverse functions. It has custom codelets for radices: 2,3,4,5 and 8, as well as a slow general discrete fourier transform (DFT) for all other prime number radices.

Secondly, I have also a real only FFT that uses symmetrical based mixing in order to do a two for one normal FFT using real data.

I wrote my FFT using kiss_fft, and engineeringproductivitytools as a guide, as well as many days and nights going "wtf, I have no idea what I'm doing". I used FFTW's fft codelet compilers to generate my radix-8 codelet, as doing code simplification by hand would have taken me another six months.

All in all it took me one year of part time coding to get this releasable.

Could be faster if

Benchmarks

Test Platforms

PlatformGCC version
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz6.3.1
ARMv7 Processor rev 10 (v7l) @ 1.00 GHz4.8.1
Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz6.3.0

Build Line

All builds used GCC with -O2 -g. The ARM build use the command line options:

    -march=armv7-a -mthumb-interwork -mfloat-abi=hard -mfpu=neon
    -mtune=cortex-a9

Measurement Procedure

The time taken to do an N point FFT every 32 samples of a 5 second 16 bit mono 44.1Khz buffer (signed 16 bit values) was measured. This was then divided by the number of FFT calculations performed to give a value of microseconds per FFT. This was done 5 times and the median value was reported.

Results for meow_fft, kiss_fft, fftw3 and pffft were taken. fftw was not tested on the ARM platform. Some tests for pffftw were skipped due to lack of support for certain values of N. pffft uses SSE/NEON vector CPU instructions.

NOTE FFTW3 results are currently wrong as its using hartly instead of real FFT transform. Updated benchmarks are pending...

Results

Values are microseconds per FFT (1,000,000 microseconds are in one second)

Power of Two values of N

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Nmeowkisspffftfftw3
640.730.580.150.15
2562.323.490.441.45
5124.365.091.023.35
10248.8913.122.197.29
204823.0024.615.2716.26
409642.3060.0411.0935.49
819284.72119.3839.4984.72
16384232.20290.2282.47189.40
32768411.35562.56208.15417.32

Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz

Nmeowkisspffft
640.871.450.29
2565.526.971.16
51210.0411.932.47
102419.3932.954.81
204853.4759.1911.43
409699.23150.4024.55
8192196.56283.2472.81
16384523.36708.37150.36
32768975.281357.65337.54

ARMv7 Processor rev 10 (v7l) @ 1.00 GHz

Nmeowkisspffft
644.947.693.77
25625.8632.2612.64
51244.3754.8428.22
102484.14146.8457.45
2048255.79280.98147.82
4096497.34758.80326.23
81921025.471502.71847.75
163842822.833891.501831.14
327685434.379110.814220.76

Non Power of Two values of N

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Nmeowkisspffftfftw3
1000.870.870.58
2001.741.891.16
5005.965.823.78
100010.7912.108.02
120013.7215.189.78
576077.6588.9720.7256.78
10000130.74160.23119.95

Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz

Nmeowkisspffft
1002.472.03
2004.504.36
50015.2712.95
100028.1427.26
120034.4436.04
5760194.78208.9447.24
10000341.90350.11

ARMv7 Processor rev 10 (v7l) @ 1.00 GHz

Nmeowkisspffft
10011.9110.45
20019.4719.76
50067.9358.33
1000117.51116.93
1200156.87175.98
5760923.851163.04612.37
100001677.871985.10

Accuracy

In a perfect world, doing an FFT then an inverse FFT on the same data, then scaling the result by 1/N should result in an identical buffer to the source data. However in real life, floating point errors can accumulate.

The first 32 values of the input data were compared to a scaled result buffer and then the difference multiplied by 65536 to simulate what the error would be for a 16 bit audio stream using 32 bit floating point FFT maths. The worst error of the first 32 values was recorded for each FFT tested for size N.

FFTMinMax
meow0.0160.031
kiss0.0160.029
pffft0.0120.043
fftw30.0000.000

Other FFT Implementations