In this section, you will learn how to improve the performance of fast convolution by improving temporal locality, i.e., by making accesses to the same memory locations occur near the same time.
The code in Section 7.1, “Fast Convolution” performs a FFT on
each row of the matrix. Then, after all the rows have been
processed, it multiplies each row of the matrix by the
replica. Suppose that there are a large number
of rows, so that data is too large to fit in
cache. In that case, while the results of the first FFT will be in
cache immediately after the FFT is complete, that data will likey
have been purged from the cache by the time the vector-matrix
multiply needs the data.
Explicitly iterating over the rows of the matrix (performing a
forward FFT, elementwise multiplication, and an inverse FFT on each
row before going on to the next one) will improve temporal
locality. You can use this approach by using an explicit loop,
rather than the implicit parallelism of Fftm
and vmmul, to take better advantage of the
cache.
You must make a few changes to the application in order to
implement this approach. Because the application will be operating
on only a single row at a time, Fftm must be
replaced with the simpler Fft. Similarly,
vmmul must be replaced with
*, which performs element-wise multiplication
of its operands. Finally, tmp can now be a
vector, rather than a matrix. (As a consequence, in addition to
being faster, this new version of the application will require less
memory.) Here is the revised program:
// Create the data cube.
Matrix<value_type> data(npulse, nrange);
Vector<value_type> tmp(nrange); // tmp is now a vector
// Create the pulse replica
Vector<value_type> replica(nrange);
// Define the FFT typedefs.
typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
for_fft_type;
typedef Fft<const_Vector, value_type, value_type, fft_inv, by_reference>
inv_fft_type;
// Create the FFT objects.
for_fft_type for_fft(Domain<1>(nrange), 1.0);
inv_fft_type inv_fft(Domain<1>(nrange), 1.0/(nrange));
// Initialize data to zero
data = value_type();
replica = value_type();
// Before fast convolution, convert the replica into the
// frequency domain
for_fft(replica);
// Perform fast convolution:
for (index_type r=0; r < nrange; ++r)
{
for_fft(data.row(r), tmp);
tmp *= replica;
inv_fft(tmp, data.row(r));
}The following graph shows that the new "interleaved" formulation is faster than the original "phased" approach for large data sets. For smaller data sets (where all of the data fits in the cache anyhow), the original method is faster because performing all of the FFTs at once is faster than performing them one by one.
