In the previous chapter, you improved the performance of the fast convolution program by exploiting temporary cache locality to process data while it was "hot" in the cache. In this section, you will convert that program to run efficiently in parallel.
If we apply maps (as in Section 8.1, “Parallel Fast Convolution”),
but do not adjust the algorithm in use, the code in Section 7.2, “Serial Optimization: Temporal Locality” will not run
faster when deployed on multiple processors. In particular, every
processor will want to update tmp for every row.
Therefore, all processors will perform the forward FFT and
vector-multiply for each row of the matrix.
VSIPL++ provides local subviews to solve
this problem. For a given processor and view, the local subview
is that portion of the view located on the processor. You can
obtain the local subview of any view by invoking its
local member function:
view_type::local_type l_data = data.local();)
Every view class defines a type
(local_type) which is the type of a local
subview. The local_type is the same kind of
view as the view containing it, so, in this case,
l_data is a matrix. There is virtually no
overhead in creating a local subview like
l_data. In particular, no data is copied;
instead, l_data just refers to the local portion of
data. We can now use the same cache-friendly
algorithm from Section 7.2, “Serial Optimization: Temporal Locality”on the
local subview:
rep_view_type::local_type l_replica = replica.local();
for (index_type l_r=0; l_r < l_data.size(0); ++l_r)
{
for_fft(l_data.row(l_r), tmp);
tmp *= l_replica;
inv_fft(tmp, l_data.row(l_r));
}Because each processor now iterates over only the rows of the matrix that are local, there is no longer any duplicated effort. Applying maps, as in Section 8.1, “Parallel Fast Convolution” above, results in the following complete program:
/***********************************************************************
Included Files
***********************************************************************/
#include <vsip/initfin.hpp>
#include <vsip/support.hpp>
#include <vsip/signal.hpp>
#include <vsip/math.hpp>
#include <vsip/map.hpp>
using namespace vsip;
/***********************************************************************
Main Program
***********************************************************************/
int
main(int argc, char** argv)
{
// Initialize the library.
vsipl vpp(argc, argv);
typedef complex<float> value_type;
typedef Map<Block_dist, Whole_dist> map_type;
typedef Dense<2, value_type, row2_type, map_type> block_type;
typedef Matrix<value_type, block_type> view_type;
typedef Dense<1, value_type, row1_type, Replicated_map<1> >
replica_block_type;
typedef Vector<value_type, replica_block_type> replica_view_type;
// Parameters.
length_type npulse = 64; // number of pulses
length_type nrange = 256; // number of range cells
// Maps.
map_type map = map_type(num_processors(), 1);
Replicated_map<1> replica_map;
// Views.
replica_view_type replica(nrange, replica_map);
view_type data(npulse, nrange, map);
Vector<value_type> tmp(nrange);
// A forward Fft for converting the time-domain data to the
// frequency domain.
typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
for_fft_type;
for_fft_type for_fft(Domain<1>(nrange), 1.0);
// An inverse Fft for converting the frequency-domain data back to
// the time-domain.
typedef Fft<const_Vector, value_type, value_type, fft_inv, by_reference>
inv_fft_type;
inv_fft_type inv_fft(Domain<1>(nrange), 1.0/nrange);
// Initialize data to zero.
data = value_type();
replica = value_type();
// Before fast convolution, convert the replica into the
// frequency domain
for_fft(replica.local());
view_type::local_type l_data = data.local();
replica_view_type::local_type l_replica = replica.local();
for (index_type l_r=0; l_r < l_data.size(0); ++l_r)
{
for_fft(l_data.row(l_r), tmp);
tmp *= l_replica;
inv_fft(tmp, l_data.row(l_r));
}
}
You may feel that the original formulation using implicitly parallel operators was simpler and more intuitive than the more-efficient variant using explicit loops. Sourcery VSIPL++ provides an extension to the VSIPL++ API that allows you to retain the elegance of that formulation while still obtaining good temporal locality.
In particular, Sourcery VSIPL++ provides a "parallel foreach" operator. This operator applies an arbitrary user-defined function (or an object that behaves like a function) to each of the rows or columns of a matrix. In this section, you will see how to use this approach.
First, declare a Fast_convolution template class. The
template parameter T is used to indicate the value
type of the fast convolution computation (such as
complex<float>):
template <typename T>
class Fast_convolution
{This class will perform the forward FFT and inverse FFTs on each row, so you must declare the FFTs:
typedef Fft<const_Vector, T, T, fft_fwd, by_reference> for_fft_type; typedef Fft<const_Vector, T, T, fft_inv, by_reference> inv_fft_type; Vector<T> replica_; Vector<T> tmp_; for_fft_type for_fft_; inv_fft_type inv_fft_;
Next, define a constructor for
Fast_convolution. The constructor stores a
copy of the replica, and also uses the replica to determine the
number of elements required for the FFTs and temporary vector.
template <typename Block>
Fast_convolution(
Vector<T, Block> replica)
: replica_(replica.size()),
tmp_ (replica.size()),
for_fft_(Domain<1>(replica.size()), 1.0),
inv_fft_(Domain<1>(replica.size()), 1.0/replica.size())
{
replica_ = replica;
}
The most important part of the
Fast_convolution class is the
operator() function. This function performs a fast
convolution for a single row of the matrix:
template <typename Block1,
typename Block2,
dimension_type Dim>
void operator()(
Vector<T, Block1> in,
Vector<T, Block2> out,
Index<Dim> /*idx*/)
{
for_fft_(in, tmp_);
tmp_ *= replica_;
inv_fft_(tmp_, out);
}
The foreach_vector template will apply the new
class you have just defined to the rows of the matrix:
Fast_convolution<value_type> fconv(replica.local()); foreach_vector<tuple<0, 1> >(fconv, data);
The resulting program contains no explicit loops, but still has good temporal locality. Here is the complete program, using the parallel foreach operator:
/***********************************************************************
Included Files
***********************************************************************/
#include <vsip/initfin.hpp>
#include <vsip/support.hpp>
#include <vsip/signal.hpp>
#include <vsip/math.hpp>
#include <vsip/map.hpp>
#include <vsip/parallel.hpp>
using namespace vsip;
/***********************************************************************
Main Program
***********************************************************************/
template <typename T>
class Fast_convolution
{
typedef Fft<const_Vector, T, T, fft_fwd, by_reference> for_fft_type;
typedef Fft<const_Vector, T, T, fft_inv, by_reference> inv_fft_type;
public:
template <typename Block>
Fast_convolution(
Vector<T, Block> replica)
: replica_(replica.size()),
tmp_ (replica.size()),
for_fft_(Domain<1>(replica.size()), 1.0),
inv_fft_(Domain<1>(replica.size()), 1.0/replica.size())
{
replica_ = replica;
}
template <typename Block1,
typename Block2,
dimension_type Dim>
void operator()(
Vector<T, Block1> in,
Vector<T, Block2> out,
Index<Dim> /*idx*/)
{
for_fft_(in, tmp_);
tmp_ *= replica_;
inv_fft_(tmp_, out);
}
// Member data.
private:
Vector<T> replica_;
Vector<T> tmp_;
for_fft_type for_fft_;
inv_fft_type inv_fft_;
};
int
main(int argc, char** argv)
{
// Initialize the library.
vsipl vpp(argc, argv);
typedef complex<float> value_type;
typedef Map<Block_dist, Whole_dist> map_type;
typedef Dense<2, value_type, row2_type, map_type> block_type;
typedef Matrix<value_type, block_type> view_type;
typedef Dense<1, value_type, row1_type, Replicated_map<1> >
replica_block_type;
typedef Vector<value_type, replica_block_type> replica_view_type;
// Parameters.
length_type npulse = 64; // number of pulses
length_type nrange = 256; // number of range cells
// Maps.
map_type map = map_type(num_processors(), 1);
Replicated_map<1> replica_map;
// Views.
replica_view_type replica(nrange, replica_map);
view_type data(npulse, nrange, map);
view_type tmp (npulse, nrange, map);
// A forward Fft for computing the frequency-domain version of
// the replica.
typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
for_fft_type;
for_fft_type for_fft (Domain<1>(nrange), 1.0);
Fast_convolution<value_type> fconv(replica.local());
// Initialize data to zero.
data = value_type();
replica = value_type();
// Before fast convolution, convert the replica into the
// frequency domain
for_fft(replica.local());
// Perform fast convolution.
foreach_vector<tuple<0, 1> >(fconv, data);
}