Table of Contents
This chapter explains how to use Sourcery VSIPL++ to perform
parallel computations. You will see how to transform the
fast convolution program from the previous chapter to run
in parallel. First you will convert the Fftm
based version. Then you will convert the improved cache
locality version. Finally, you will learn how to handle
input and output when working in parallel.
The first fast convolution program in the previous chapter makes
use of two implicitly parallel operators: Fftm and
vmmul. These operators are implicitly parallel
in the sense that they process each row of the matrix
independently. If you had enough processors, you could put each
row on a separate processor and then perform the entire
computation in parallel.
In the VSIPL++ API, you have explicit control of the number of processors used for a computation. Since the default is to use just a single processor, the program in Section 7.1, “Fast Convolution” will not run in parallel, even on a multi-processor system. This section will show you how to use maps to take advantage of multiple processors. Using a map tells Sourcery VSIPL++ to distribute a single block of data across multiple processors. Then, Sourcery VSIPL++ will automatically move data between processors as necessary.
The VSIPL++ API uses the Single-Program Multiple-Data (SPMD) model for parallelism. In this model, every processor runs the same program, but operates on different sets of data. For example, in the fast convolution example, multiple processors perform FFTs at the same time, but each processor handles different rows in the matrix.
Every map has both compile-time and run-time properties. At
compile-time, you specify the distribution
that will be applied to each dimension. In this example, you will
use a block distribution to distribute the
rows of the matrix. A block distribution divides a view into
continguous chunks. For example, suppose that you have a
4-processor system. Since there are 64 rows in the matrix
data, there will be 16 rows on each processor.
The block distribution will place the first 16 rows (rows 0 through
15) on processor 0, the next 16 rows (rows 16 through 31) on
processor 1, and so forth. You do not want to distribute the
columns of the matrix at all, so you will use a whole
distribution for the columns.
Although the distributions are selected at compile-time, the number
of processors to use in each dimension is not specified until
run-time. By specifying the number of processors at run-time, you
can adapt your program to the configuration of the machine on which
your application is running. The VSIPL++ API provides a
num_processors function to tell you the total
number of processors available. Of course, since each row should
be kept on a single processor, the number of processors used in the
column dimension is just one. So, here is the code required to
create the map:
typedef Map<Block_dist, Whole_dist> map_type;
map_type map = map_type(/*rows=*/num_processors(),
/*columns=*/1);
Next, you have to tell Sourcery VSIPL++ to use this map for the relevant
views. Every view has an underlying block.
The block indicates how the view's data is stored. Until this
point, you have been using the default Dense
block, which stores data in a continguous array on a single
processor. Now, you want to use a continguous array on
multiple processors, so you must explicitly
distribute the block. Then, when declaring views, you must
explicitly indicate that the view should use the distributed block:
typedef Dense<2, value_type, row2_type, map_type> block_type; typedef Matrix<value_type, block_type> view_type; view_type data(npulse, nrange, map); view_type tmp(npulse, nrange, map);
Performing the vector-matrix multiply requires a complete copy of
replica on each processor. An ordinary map
divides data among processors, but, here, the goal is to copy the
same data to multiple processors. Sourcery VSIPL++ provides a
special Replicated_map class to use in this
situation. So, you should declare replica as
follows:
Replicated_map<1> replica_map;
typedef Dense<1, value_type, row1_type, Replicated_map<1> >
replica_block_type;
typedef Vector<value_type, replica_block_type> replica_view_type;
replica_view_type replica(nrange, replica_map);Because the application already uses implicitly parallel operators, no further changes are required. The entire algorithm (i.e., the part of the code that performs FFTs and vector-matrix multiplication) remains unchanged.
The complete parallel program is:
/***********************************************************************
Included Files
***********************************************************************/
#include <vsip/initfin.hpp>
#include <vsip/support.hpp>
#include <vsip/signal.hpp>
#include <vsip/math.hpp>
#include <vsip/map.hpp>
using namespace vsip;
/***********************************************************************
Main Program
***********************************************************************/
int
main(int argc, char** argv)
{
// Initialize the library.
vsipl vpp(argc, argv);
typedef complex<float> value_type;
typedef Map<Block_dist, Whole_dist> map_type;
typedef Dense<2, value_type, row2_type, map_type> block_type;
typedef Matrix<value_type, block_type> view_type;
typedef Dense<1, value_type, row1_type, Replicated_map<1> >
replica_block_type;
typedef Vector<value_type, replica_block_type> replica_view_type;
// Parameters.
length_type npulse = 64; // number of pulses
length_type nrange = 256; // number of range cells
// Maps.
map_type map = map_type(num_processors(), 1);
Replicated_map<1> replica_map;
// Views.
replica_view_type replica(nrange, replica_map);
view_type data(npulse, nrange, map);
view_type tmp (npulse, nrange, map);
// A forward Fft for computing the frequency-domain version of
// the replica.
typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
for_fft_type;
for_fft_type for_fft (Domain<1>(nrange), 1.0);
// A forward Fftm for converting the time-domain data matrix to the
// frequency domain.
typedef Fftm<value_type, value_type, row, fft_fwd, by_reference>
for_fftm_type;
for_fftm_type for_fftm(Domain<2>(npulse, nrange), 1.0);
// An inverse Fftm for converting the frequency-domain data back to
// the time-domain.
typedef Fftm<value_type, value_type, row, fft_inv, by_reference>
inv_fftm_type;
inv_fftm_type inv_fftm(Domain<2>(npulse, nrange), 1.0/(nrange));
// Initialize data to zero.
data = value_type();
replica = value_type();
// Before fast convolution, convert the replica to the the
// frequency domain
for_fft(replica);
// Perform fast convolution:
// Convert to the frequency domain.
for_fftm(data, tmp);
// Perform element-wise multiply for each pulse.
tmp = vmmul<0>(replica, tmp);
// Convert back to the time domain.
inv_fftm(tmp, data);
}
The following graph shows the parallel speedup of the fast convolution program from 1 to 32 processors using a 3.0 GHz Pentium cluster system. As you can see, increasing the number of processors also increases the performance of the program.
