— shorttitle: The FFT title: The Fast Fourier Transform —
The fast Fourier transform is, in my opinion one of the most beautiful algorithms in all of computer science. The most direct and obvious application of the fast Fourier transform may is to signal and image processing. However, the efficiency of the FFT and the usefulness of Fourier analysis make the algorithm almost ubiquitous in many areas of engineering and applied mathematics.
Before we go on to discuss the FFT in detail, we will begin with the following motivating example. Suppose we have two polynomials p(x) = a0 + a1x + ⋯ + anxn and q(x) = b0 + b1x + ⋯ + bmxm. Then we know that their product is a polynomial and we can write p(x)q(x) = c0 + c1x + ⋯ + cn + mxn + m. In particular, we can do this by foiling out the two polynomials: $$p(x)q(x) = \left( \sum_{i=0}^n a_i x^i \right) \left( \sum_{j=0}^m b_j x^j \right) = \sum_{i=0}^n \sum_{j=0}^m a_i b_j x^{i + j}$$
To compute the product, we can write a fairly simple algorithm that computes the ci in terms of the ai and bi:
void polynomial_multiply(complex *c, const complex *a, size_t n,
        const complex *b, size_t m)
{
    int i, j;
    for (i = 0; i <= m + n; ++i)
        c[i] = 0;
    for (i = 0; i < n; ++i)
        for (j = 0; j < m; ++j)
            c[i + j] += a[i] * b[j];
}
By inspection, we can see that this algorithm runs in O(nm) time.
Now we know that multiplying polynomials algebraically is equivalent to multiplying them as functions pointwise. Therefore, if we pick n + m different points xi and evaluate p(x) and q(x) at each of these points, then we can use Lagrange interpolation on the points p(xi)q(xi) to get back the polynomial p(x)q(x). However, evaluating p(x) at a point is a O(n) operation and evaluating q(x) at a point is a O(m) operation so simply getting the points p(xi)q(xi) is a O((m + n)2) operation before we even perform Lagrange interpolation.
We will come back to this discussion once we have discussed the FFT and show that this operation can be done in O((m + n)log (m + n)) (or, if we assume that m = n, O(nlog n)) time.
Before we discuss the fast Fourier transform, we will briefly review of the discrete Fourier transform. If you are already familiar with the DFT, you may easily skip on to the next section and come back here only for reference.
Let z0, z1, …, zN − 1 be a finite sequence of complex numbers. For the sake of this discussion, we shall call such a sequence a signal. We define the discrete Fourier of zn by $$\hat z_k = \sum_{n=0}^{N-1} z_n e^{-\frac{2 \pi ikn}{N}}$$ for 0 ≤ k < N. In order to simplify the notation a bit, let $\omega_N = e^{\frac{2 \pi i}{N}}$ be the first primitive $N^{\rm th}$ root of unity. Then the formula above becomes $$\hat z_k = \sum_{n=0}^{N-1} z_n \omega_N^{-kn} \text{.}$$
The ẑk that we get from the above two formulas are called the Fourier coefficients of the signal (zk). In order to do anything useful with the Fourier coefficients, we need to be able to write the original signal in terms of its Fourier coefficients. Fix some 0 ≤ l < N and consider the following sum:
$$\sum_{k=0}^{N-1} \hat z_k \omega_N^{kl} = \sum_{k=0}^{N-1} \sum_{n=0}^{N-1} \hat z_n \omega_N^{kn} \omega_N^{kl} = \sum_{n=0}^{N-1} z_n \sum_{k=0}^{N-1} \omega_N^{k(l - n)}$$ In the case where l = n, $\sum_{k=0}^{N-1} \omega_n^{k(l - n)} = N$ because l − n = 0. We now consider the case where l ≠ n. Observe that if ω is any $N^{\rm th}$ root of unity, 0 = 1 − ωN = (1 − ω)(1 + ω + ω2 + ⋯ + ωN − 1). If ω ≠ 1, then we have $\sum_{n=0}^{N-1} \omega^n = 0$. In particular, (ωnl − n)N = e2πi(l − n) = 1. Since |l − n| < N and l ≠ n, ωl − n ≠ 1. Then, by what we just asserted, $\sum_{k=0}^{N-1} \omega_N^{k(l-n)} = 0$. Therefore, putting these pieces together, we have that $$\sum_{k=0}^{N-1} \omega_N^{k(l - n)} = \begin{cases} N &\text{ if } l = n \\ 0 &\text{ if } l \ne n \end{cases}$$ and $$\sum_{k=0}^{N-1} \hat z_k \omega_N^{kl} = N z_l \quad \text{or} \quad z_n = \frac{1}{N} \sum_{k=0}^{N-1} \hat z_k \omega_N^{kn}$$
This gives us a very nice formula for computing the original signal in terms of its Fourier coefficients. We can also see from the above formula one of the primary purposes of the DFT. Namely, that it allows us to write any signal as a sum of sinusoids. For this reason, the Fourier coefficients ẑk are frequently referred to as the frequency domain because they provide the coefficients for the $k^{\rm th}$ frequency in the sum.
We can directly use the definition to compute the DFT of a signal as follows:
/* Compute exp(2 PI i k / N) */
complex omega(size_t N, int k);
void dft(complex *z_hat, const complex *z, size_t N)
{
    int n, k;
    for (k = 0; k < N; ++k) {
        z_hat[k] = 0;
        for (n = 0; n < n; ++n) {
            z_hat[k] += z[n] * omega(N, -k * n);
        }
    }
}
By inspection, we can see that the running time of the above algorithm is O(N2). As we will see shortly, under the right restrictions, this can be significantly reduced.
There are many algorithms that all bare the name "fast Fourier transform". The particular version we will discuss is the radix-2 form of the Cooley-Tucky algorithm. While not always the most efficient or least restrictive, this is by far the most common version of the FFT.
Before we discuss the actual fast Fourier transform algorithm, we should say a word about the inverse DFT. Notice that the only difference between the formula for the DFT and the formula for the inverse DFT is the presence of a factor of $\frac{1}{N}$ and a minus sign on the exponent of ωN. Since the Fourier coefficients are N-periodic, this means that we can compute the inverse DFT by computing the DFT, dividing by N and reversing the sequence. Therefore, if we can find an efficient algorithm for computing the DFT we will also have an efficient algorithm for computing the inverse DFT.
The first key observation that makes the FFT work was originally made by Gauss in 1805 and then independently rediscovered by Tucky in 1965. They observed that, if N is even, you can rewrite the DFT as the sum of two smaller DFT’s as follows: $$\begin{aligned} \hat z_k = \sum_{n=0}^{N-1} z_n \omega_N^{-kn} &= \sum_{m=0}^{N/2-1} z_{2m} \omega_N^{-2kn} + \omega_N^{-k} \sum_{m=0}^{N/2-1} z_{2m} \omega_N^{-2kn} \\ &= \underbrace{ \sum_{m=0}^{N/2-1} z_{2m} \omega_{N/2}^{-kn} }_\text{DFT of even n} + \omega_N^{-k} \underbrace{ \sum_{m=0}^{N/2-1} z_{2m} \omega_{N/2}^{-kn} }_\text{DFT of odd n} \end{aligned}$$
Applying the above directly is not significantly faster than computing the DFT from the original formula. It may save a few multiplication operations but it is still O(N) per coefficient and O(N2) overall. Achieving better asymptotic running time requires a second key observation.
Let em = z2m be the even subsequence of zn and om = z2m + 1 be the odd subsequence of zn. What we have shown above is that ẑk = êk + ωN−kôk. Since the sequences en and on are of length $\frac{N}{2}$, their Fourier coefficients are $\frac{N}{2}$-periodic. (This can be easily seen by directly applying the definition to get ôk + N/2 = ôk.) Also, ωNN/2 = eπi = −1 so ωN−k = −ωn−k + N/2. Therefore, putting these pieces together, we have: $$\hat z_k = \begin{cases} \hat e_k + \omega_N^{-k} \hat o_k &\text{ if } 0 \le k < N/2 \\ \hat e_{k - N/2} - \omega_N^{-k + N/2} \hat o_{k - N/2} &\text{ if } N/2 \le k < N \end{cases}$$
What is important about this observation is that it means we only need the first N/2 Fourier coefficients of the subsequences em and om in order to compute all of the Fourier coefficients of zn. If we assume that N is a power of 2, we can use this to create a recursive algorithm for computing the discrete Fourier transform. In the base case where N = 1, we have ẑ0 = z0. For larger N, you first compute the Fourier coefficients for om = z2m and em = z2m + 1 and then compute the Fourier coefficients of zn using the piecewise-defined formula above.
Assuming the existence of a complex number data type, we can implement the algorithm as follows:
/* Compute exp(2 PI i k / N) */
complex omega(size_t N, int k);
void fft(complex *z_hat, const complex *z, size_t N, size_t stride)
{
    int k;
    complex ek_hat, ok_hat;
    if (N == 1) {
        *z_hat = *z;
    } else {
        fft(z_hat, z, N / 2, 2 * stride);
        fft(z_hat + (N / 2), z + stride, N / 2, 2 * stride);
        for (k = 0; k < (N / 2); ++k) {
            ek_hat = z_hat[k];
            ok_hat = z_hat[k + (N / 2)];
            z_hat[k] = ek_hat + omega(N, -k) * ok_hat;
            z_hat[k + (N / 2)] = ek_hat - omega(N, -k) * ok_hat;
        }
    }
}
The implementation above deserves a little more discussion. First,
observe the stride argument. This argument tells the
fft function that instead of taking every element of the
array, it should skip stride many elements each time. For
example, for a stride of 2, it would only consider every other element
instead of every element. Symbolically, this means that if we have a
stride of s and zn is our input
sequence, we are taking the subsequence zsm.
This allows us to easily specify simple subsequences. In particular, it
is used to sort out the even and odd subsequences.
The first thing the above implementation does is to check the base case:
if (N == 1) {
    *z_hat = *z;
}
As stated above, the base-case for this algorithm is to simply set ẑ0 = z0. The next step is to make the recursive call:
fft(z_hat, z, N / 2, 2 * stride);
fft(z_hat + (N / 2), z + stride, N / 2, 2 * stride);
We will ignore the first parameter for the moment and focus on the
last three. The first call computes the DFT of the even sequence. If our
input sequence is zsm
(where s is the stride) and
has length N, then the even
sequence is given by em = z2sm
and has length N/2. The second
call computes the DFT of the odd sequence. The odd sequence is given by
om = zs(2m + 1) = z2sm + s
and has length N/2. In order
to take the sequence z2sm + s
instead of z2sm we
pass in z + stride as our sequence. Since
z + strice points to zs instead of
z0, the recursive
call will take zs as the start
of the sequence. Then simply specifying a stride of 2s is sufficient for it to pick all
of the odd elements.
Now consider the first parameter to each of our recursive calls. In
the first, we are telling the recursive call to fft to
store the Fourier coefficients of the even subsequence in our output
array. However, the recursive call will only produce N/2 Fourier coefficients, so half of
the array is empty. We then tell the second recursive call to
fft to store the Fourier coefficients of the odd
subsequence in the second half of the output array. A natural question
to ask would be, “If we put the output from the recursive call in the
output array, where do we put the final Fourier coefficients?” As we
will see shortly, we can do the rest of the computation in-place. This
allows us to save significantly on memory usage.
The final step of the recursive algorithm is computing the Fourier coefficients of zn from those of z2m and z2m + 1:
for (k = 0; k < (N / 2); ++k) {
    ek_hat = z_hat[k];
    ok_hat = z_hat[k + (N / 2)];
    z_hat[k] = ek_hat + omega(N, -k) * ok_hat;
    z_hat[k + (N / 2)] = ek_hat - omega(N, -k) * ok_hat;
}
The first thing we do in our loop is to save off the kth Fourier coefficients
of em and
om.
Because we have already computed the fourier coefficients of em and om and put them
in the first and second halves (respectively) of z_hat, we
can find êk at
z_hat[k] and ôk at
z_hat[k + (N / 2)]. The second thing we do in our loop is
to compute ẑk and ẑk + N/2.
We know from above that, if 0 ≤ k < N/2, ẑk = êk + ωn−kôk.
Also, from our piecewise definition, replacing k with k + N/2, we have that ẑk + N/2 = êk − ωn−kôk.
Finally observe that, in all our calculations inside the loop, we only
ever use the k and k + N/2 entries of
z_hat. Therefore, since 0 ≤ k < N/2, each
iteration of the loop uses different elements of z_hat and
the iterations are therefore independant. In particular, this means that
there is no problem in storing êk and ôk in the first
and second halves of z_hat.
Consider the implementation above. In all but the base case (which is constant time) there are really two things that happen. First is a pair of recursive calls and second is a loop. If we ignore saving off the êk and ôk (a safe assumption given compiler optimization), then the only thing that happens in the loop the calculation êk ± ωN−kôk is worked twice. Since we loop N/2 times, the time consumed in the loop is the time required to work the calculation êk ± ωN−kôk N-times.
Now consider the recursive call. The recursive call is made twice, but each time it is made on a sequence of length N/2 so the cost of the recursive call is N times the cost of the calculation êk ± ωN−kôk plus the cost of the next level of recursive calls. Therefore, the cost of the entire computation is dN times the cost of êk ± ωN−kôk where d is the depth of the recursion. Since we divide the sequence in half each time, the depth will be log2(N) so this gives us a total running time of O(Nlog N).
While the implementation provided above works, is not the best possible implementation. There are a number if simple optimizations that you will commonly see done when implementing a fast Fourier transform. First is that we can precompute all of the ωNk. Computing all of the ωNk needed for a given transform is a O(N) operation because we need only compute the Nth roots of unity for the largest N (at each of the recursive steps, a factor of N is used). While this consumes more memory, it can significantly increase performance because the power series expansion required to compute a root of unity is far more costly than the simple multiplication and addition operations required for the rest of the FFT algorithm.
Along these same lines, many high-performance FFT implementations require a two-stage process. First, you specify the size of the input sequence and it computes all of the required roots of unity. After that, you give it a sequence of the given size and the actual FFT computation is performed. This way you can work the same FFT operation on many different sequences without ever recomputing the roots of unity. This technique is essential to get the performance required for any sort of real-time signal processing.
This variant of the fast Fourier transform is called radix-2 because we split each DFT into two smaller DFT operations. If you go back and look at the initial observation made by Gauss, it is easy to generalize. If d is any number that divides N then you can split the DFT of a sequence of length N into d DFTs on sequences of length N/d. The algorithm for other radixes is basically the same as the radix-2 algorithm except that more recursive calls are made at each step and the formula for computing the Fourier coefficients from the Fourier coefficients of the subsequences is more complicated. Sometimes other radixes are used either for performance reasons on certain hardware or to allow for signals of a length other than a power of two.
We will finally return to the example we gave at the beginning. Let p(x) and q(x) be polynomials and let N be a power of two so that N ≥ deg p(x) + deg q(x) + 1. We can write p(x) = a0 + a1x + ⋯ + aN − 1xN − 1 and q(x) = b0 + b1x + ⋯bN − 1xN − 1 by letting some of the an and bn be zero if needed. Then, for all 0 ≤ k < N, $$\hat a_k = \sum_{n=0}^N-1 a_n \omega_N^{-kn} = \sum_{n=0}^N-1 a_n (\omega_N^{-k})^n = p(\omega_N^{-k})$$ and by a similar calculation, b̂k = q(ωN−k).
If pq(x) denotes the product of the polynomials p(x) and q(x) then we know that pq(ωN−k) = p(ωN−k)q(ωN−k) = âkb̂k. If we write pq(x) = c0 + c1x + ⋯cN − 1xN − 1 (we can do this because we chose N ≥ deg p(x) + deg q(x) + 1) then, repeating the above computation, ĉk = pq(ωN−k) = âkb̂k. Therefore, we can get the cn by applying the inverse discrete Fourier transform: $$c_n = \frac{1}{N} \sum_{k=0}^{N-1} \hat c_k \omega_N^{kn} = \frac{1}{N} \sum_{k=0}^{N-1} \hat a_k \hat b_k \omega_N^{kn}$$
How does this help us computationally? We can compute the discrete Fourier transform of an and bn in O(Nlog N) time using the FFT. We can then compute ĉk = âkb̂k in linear time. Finally, we can get the cn by computing the inverse Fourier transform which is, again, a O(Nlog N) operation. This means that, all told, we can multiply two polynomials in O(Nlog N) time which is significantly better than O(N2) which was our original running time. On small polynomials the overhead of doing 3 Fourier transforms will make the original naive algorithm faster, but on large polynomials, the better asymptotic running time will yield better performance.