7.2. Serial Optimization: Temporal Locality

In this section, you will learn how to improve the performance of fast convolution by improving temporal locality, i.e., by making accesses to the same memory locations occur near the same time.

The code in Section 7.1, “Fast Convolution” performs a FFT on each row of the matrix. Then, after all the rows have been processed, it multiplies each row of the matrix by the replica. Suppose that there are a large number of rows, so that data is too large to fit in cache. In that case, while the results of the first FFT will be in cache immediately after the FFT is complete, that data will likey have been purged from the cache by the time the vector-matrix multiply needs the data.

Explicitly iterating over the rows of the matrix (performing a forward FFT, elementwise multiplication, and an inverse FFT on each row before going on to the next one) will improve temporal locality. You can use this approach by using an explicit loop, rather than the implicit parallelism of Fftm and vmmul, to take better advantage of the cache.

You must make a few changes to the application in order to implement this approach. Because the application will be operating on only a single row at a time, Fftm must be replaced with the simpler Fft. Similarly, vmmul must be replaced with *, which performs element-wise multiplication of its operands. Finally, tmp can now be a vector, rather than a matrix. (As a consequence, in addition to being faster, this new version of the application will require less memory.) Here is the revised program:

  // Create the data cube.
  Matrix<value_type> data(npulse, nrange);
  Vector<value_type> tmp(nrange);            // tmp is now a vector

  // Create the pulse replica
  Vector<value_type> replica(nrange);

  // Define the FFT typedefs.
  typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
		for_fft_type;
  typedef Fft<const_Vector, value_type, value_type, fft_inv, by_reference>
		inv_fft_type;

  // Create the FFT objects.
  for_fft_type  for_fft(Domain<1>(nrange), 1.0);
  inv_fft_type  inv_fft(Domain<1>(nrange), 1.0/(nrange));

  // Initialize data to zero
  data    = value_type();
  replica = value_type();

  // Before fast convolution, convert the replica into the
  // frequency domain
  for_fft(replica);

  // Perform fast convolution:
  for (index_type r=0; r < nrange; ++r)
  {
    for_fft(data.row(r), tmp);
    tmp *= replica;
    inv_fft(tmp, data.row(r));
  }

The following graph shows that the new "interleaved" formulation is faster than the original "phased" approach for large data sets. For smaller data sets (where all of the data fits in the cache anyhow), the original method is faster because performing all of the FFTs at once is faster than performing them one by one.