8.2. Improving Parallel Temporal Locality

In the previous chapter, you improved the performance of the fast convolution program by exploiting temporary cache locality to process data while it was "hot" in the cache. In this section, you will convert that program to run efficiently in parallel.

If we apply maps (as in Section 8.1, “Parallel Fast Convolution”), but do not adjust the algorithm in use, the code in Section 7.2, “Serial Optimization: Temporal Locality” will not run faster when deployed on multiple processors. In particular, every processor will want to update tmp for every row. Therefore, all processors will perform the forward FFT and vector-multiply for each row of the matrix.

VSIPL++ provides local subviews to solve this problem. For a given processor and view, the local subview is that portion of the view located on the processor. You can obtain the local subview of any view by invoking its local member function:

  view_type::local_type     l_data    = data.local();)

Every view class defines a type (local_type) which is the type of a local subview. The local_type is the same kind of view as the view containing it, so, in this case, l_data is a matrix. There is virtually no overhead in creating a local subview like l_data. In particular, no data is copied; instead, l_data just refers to the local portion of data. We can now use the same cache-friendly algorithm from Section 7.2, “Serial Optimization: Temporal Locality”on the local subview:

 rep_view_type::local_type l_replica = replica.local();

 for (index_type l_r=0; l_r < l_data.size(0); ++l_r)
 {
   for_fft(l_data.row(l_r), tmp);
   tmp *= l_replica;
   inv_fft(tmp, l_data.row(l_r));
 }

Because each processor now iterates over only the rows of the matrix that are local, there is no longer any duplicated effort. Applying maps, as in Section 8.1, “Parallel Fast Convolution” above, results in the following complete program:

/***********************************************************************
  Included Files
***********************************************************************/

#include <vsip/initfin.hpp>
#include <vsip/support.hpp>
#include <vsip/signal.hpp>
#include <vsip/math.hpp>
#include <vsip/map.hpp>

using namespace vsip;



/***********************************************************************
  Main Program
***********************************************************************/

int
main(int argc, char** argv)
{
  // Initialize the library.
  vsipl vpp(argc, argv);

  typedef complex<float> value_type;

  typedef Map<Block_dist, Whole_dist>               map_type;
  typedef Dense<2, value_type, row2_type, map_type> block_type;
  typedef Matrix<value_type, block_type>            view_type;

  typedef Dense<1, value_type, row1_type, Replicated_map<1> >
                                                    replica_block_type;
  typedef Vector<value_type, replica_block_type>    replica_view_type;

  // Parameters.
  length_type npulse = 64;	// number of pulses
  length_type nrange = 256;	// number of range cells

  // Maps.
  map_type          map = map_type(num_processors(), 1);
  Replicated_map<1> replica_map;

  // Views.
  replica_view_type  replica(nrange, replica_map);
  view_type          data(npulse, nrange, map);
  Vector<value_type> tmp(nrange); 

  // A forward Fft for converting the time-domain data to the
  // frequency domain.
  typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
		for_fft_type;
  for_fft_type  for_fft(Domain<1>(nrange), 1.0);

  // An inverse Fft for converting the frequency-domain data back to
  // the time-domain.
  typedef Fft<const_Vector, value_type, value_type, fft_inv, by_reference>
	  	inv_fft_type;
  inv_fft_type  inv_fft(Domain<1>(nrange), 1.0/nrange);

  // Initialize data to zero.
  data    = value_type();
  replica = value_type();

  // Before fast convolution, convert the replica into the
  // frequency domain
  for_fft(replica.local());

  view_type::local_type         l_data    = data.local();
  replica_view_type::local_type l_replica = replica.local();

  for (index_type l_r=0; l_r < l_data.size(0); ++l_r)
  {
    for_fft(l_data.row(l_r), tmp);
    tmp *= l_replica;
    inv_fft(tmp, l_data.row(l_r));
  }
}
 

8.2.1. Implicit Parallelism: Parallel Foreach

You may feel that the original formulation using implicitly parallel operators was simpler and more intuitive than the more-efficient variant using explicit loops. Sourcery VSIPL++ provides an extension to the VSIPL++ API that allows you to retain the elegance of that formulation while still obtaining good temporal locality.

In particular, Sourcery VSIPL++ provides a "parallel foreach" operator. This operator applies an arbitrary user-defined function (or an object that behaves like a function) to each of the rows or columns of a matrix. In this section, you will see how to use this approach.

First, declare a Fast_convolution template class. The template parameter T is used to indicate the value type of the fast convolution computation (such as complex<float>):

template <typename T>
 class Fast_convolution
 {

This class will perform the forward FFT and inverse FFTs on each row, so you must declare the FFTs:

  typedef Fft<const_Vector, T, T, fft_fwd, by_reference> for_fft_type;
   typedef Fft<const_Vector, T, T, fft_inv, by_reference> inv_fft_type;

   Vector<T>    replica_;
   Vector<T>    tmp_;
   for_fft_type for_fft_;
   inv_fft_type inv_fft_;

Next, define a constructor for Fast_convolution. The constructor stores a copy of the replica, and also uses the replica to determine the number of elements required for the FFTs and temporary vector.

template <typename Block>
   Fast_convolution(
     Vector<T, Block> replica)
     : replica_(replica.size()),
       tmp_    (replica.size()),
       for_fft_(Domain<1>(replica.size()), 1.0),
       inv_fft_(Domain<1>(replica.size()), 1.0/replica.size())
   {
     replica_ = replica;
   }

The most important part of the Fast_convolution class is the operator() function. This function performs a fast convolution for a single row of the matrix:

  template <typename       Block1,
	     typename       Block2,
	     dimension_type Dim>
   void operator()(
     Vector<T, Block1> in,
     Vector<T, Block2> out,
     Index<Dim>        /*idx*/)
   {
     for_fft_(in, tmp_);
     tmp_ *= replica_;
     inv_fft_(tmp_, out);
   }

The foreach_vector template will apply the new class you have just defined to the rows of the matrix:

  Fast_convolution<value_type> fconv(replica.local());
   foreach_vector<tuple<0, 1> >(fconv, data);

The resulting program contains no explicit loops, but still has good temporal locality. Here is the complete program, using the parallel foreach operator:

/***********************************************************************
  Included Files
***********************************************************************/

#include <vsip/initfin.hpp>
#include <vsip/support.hpp>
#include <vsip/signal.hpp>
#include <vsip/math.hpp>
#include <vsip/map.hpp>
#include <vsip/parallel.hpp>

using namespace vsip;



/***********************************************************************
  Main Program
***********************************************************************/

template <typename T>
class Fast_convolution
{
  typedef Fft<const_Vector, T, T, fft_fwd, by_reference> for_fft_type;
  typedef Fft<const_Vector, T, T, fft_inv, by_reference> inv_fft_type;

public:
  template <typename Block>
  Fast_convolution(
    Vector<T, Block> replica)
    : replica_(replica.size()),
      tmp_    (replica.size()),
      for_fft_(Domain<1>(replica.size()), 1.0),
      inv_fft_(Domain<1>(replica.size()), 1.0/replica.size())
  {
    replica_ = replica;
  }

  template <typename       Block1,
	    typename       Block2,
	    dimension_type Dim>
  void operator()(
    Vector<T, Block1> in,
    Vector<T, Block2> out,
    Index<Dim>        /*idx*/)
  {
    for_fft_(in, tmp_);
    tmp_ *= replica_;
    inv_fft_(tmp_, out);
  }

  // Member data.
private:
  Vector<T>    replica_;
  Vector<T>    tmp_;
  for_fft_type for_fft_;
  inv_fft_type inv_fft_;
};



int
main(int argc, char** argv)
{
  // Initialize the library.
  vsipl vpp(argc, argv);

  typedef complex<float> value_type;

  typedef Map<Block_dist, Whole_dist>               map_type;
  typedef Dense<2, value_type, row2_type, map_type> block_type;
  typedef Matrix<value_type, block_type>            view_type;

  typedef Dense<1, value_type, row1_type, Replicated_map<1> >
                                                    replica_block_type;
  typedef Vector<value_type, replica_block_type>    replica_view_type;

  // Parameters.
  length_type npulse = 64;	// number of pulses
  length_type nrange = 256;	// number of range cells

  // Maps.
  map_type          map = map_type(num_processors(), 1);
  Replicated_map<1> replica_map;

  // Views.
  replica_view_type replica(nrange, replica_map);
  view_type         data(npulse, nrange, map);
  view_type         tmp (npulse, nrange, map);

  // A forward Fft for computing the frequency-domain version of
  // the replica.
  typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
		for_fft_type;
  for_fft_type  for_fft (Domain<1>(nrange), 1.0);

  Fast_convolution<value_type> fconv(replica.local());

  // Initialize data to zero.
  data    = value_type();
  replica = value_type();

  // Before fast convolution, convert the replica into the
  // frequency domain
  for_fft(replica.local());

  // Perform fast convolution.
  foreach_vector<tuple<0, 1> >(fconv, data);
}