Chapter 8. Parallel Fast Convolution

Abstract

This chapter describes how to create and run parallel VSIPL++ programs with Sourcery VSIPL++. You can modify the programs to develop your own parallel applications.

Table of Contents

8.1. Parallel Fast Convolution
8.2. Improving Parallel Temporal Locality
8.3. Performing I/O

This chapter explains how to use Sourcery VSIPL++ to perform parallel computations. You will see how to transform the fast convolution program from the previous chapter to run in parallel. First you will convert the Fftm based version. Then you will convert the improved cache locality version. Finally, you will learn how to handle input and output when working in parallel.

8.1. Parallel Fast Convolution

The first fast convolution program in the previous chapter makes use of two implicitly parallel operators: Fftm and vmmul. These operators are implicitly parallel in the sense that they process each row of the matrix independently. If you had enough processors, you could put each row on a separate processor and then perform the entire computation in parallel.

In the VSIPL++ API, you have explicit control of the number of processors used for a computation. Since the default is to use just a single processor, the program in Section 7.1, “Fast Convolution” will not run in parallel, even on a multi-processor system. This section will show you how to use maps to take advantage of multiple processors. Using a map tells Sourcery VSIPL++ to distribute a single block of data across multiple processors. Then, Sourcery VSIPL++ will automatically move data between processors as necessary.

The VSIPL++ API uses the Single-Program Multiple-Data (SPMD) model for parallelism. In this model, every processor runs the same program, but operates on different sets of data. For example, in the fast convolution example, multiple processors perform FFTs at the same time, but each processor handles different rows in the matrix.

Every map has both compile-time and run-time properties. At compile-time, you specify the distribution that will be applied to each dimension. In this example, you will use a block distribution to distribute the rows of the matrix. A block distribution divides a view into continguous chunks. For example, suppose that you have a 4-processor system. Since there are 64 rows in the matrix data, there will be 16 rows on each processor. The block distribution will place the first 16 rows (rows 0 through 15) on processor 0, the next 16 rows (rows 16 through 31) on processor 1, and so forth. You do not want to distribute the columns of the matrix at all, so you will use a whole distribution for the columns.

Although the distributions are selected at compile-time, the number of processors to use in each dimension is not specified until run-time. By specifying the number of processors at run-time, you can adapt your program to the configuration of the machine on which your application is running. The VSIPL++ API provides a num_processors function to tell you the total number of processors available. Of course, since each row should be kept on a single processor, the number of processors used in the column dimension is just one. So, here is the code required to create the map:

  typedef Map<Block_dist, Whole_dist>               map_type;
  map_type map = map_type(/*rows=*/num_processors(), 
                          /*columns=*/1);

Next, you have to tell Sourcery VSIPL++ to use this map for the relevant views. Every view has an underlying block. The block indicates how the view's data is stored. Until this point, you have been using the default Dense block, which stores data in a continguous array on a single processor. Now, you want to use a continguous array on multiple processors, so you must explicitly distribute the block. Then, when declaring views, you must explicitly indicate that the view should use the distributed block:

  typedef Dense<2, value_type, row2_type, map_type> block_type;
  typedef Matrix<value_type, block_type>            view_type;
  view_type data(npulse, nrange, map);
  view_type tmp(npulse, nrange, map);

Performing the vector-matrix multiply requires a complete copy of replica on each processor. An ordinary map divides data among processors, but, here, the goal is to copy the same data to multiple processors. Sourcery VSIPL++ provides a special Replicated_map class to use in this situation. So, you should declare replica as follows:

  Replicated_map<1> replica_map;
  typedef Dense<1, value_type, row1_type, Replicated_map<1> >
                                                    replica_block_type;
  typedef Vector<value_type, replica_block_type>    replica_view_type;
  replica_view_type replica(nrange, replica_map);

Because the application already uses implicitly parallel operators, no further changes are required. The entire algorithm (i.e., the part of the code that performs FFTs and vector-matrix multiplication) remains unchanged.

The complete parallel program is:

/***********************************************************************
  Included Files
***********************************************************************/

#include <vsip/initfin.hpp>
#include <vsip/support.hpp>
#include <vsip/signal.hpp>
#include <vsip/math.hpp>
#include <vsip/map.hpp>

using namespace vsip;



/***********************************************************************
  Main Program
***********************************************************************/

int
main(int argc, char** argv)
{
  // Initialize the library.
  vsipl vpp(argc, argv);

  typedef complex<float> value_type;

  typedef Map<Block_dist, Whole_dist>               map_type;
  typedef Dense<2, value_type, row2_type, map_type> block_type;
  typedef Matrix<value_type, block_type>            view_type;

  typedef Dense<1, value_type, row1_type, Replicated_map<1> >
                                                    replica_block_type;
  typedef Vector<value_type, replica_block_type>    replica_view_type;

  // Parameters.
  length_type npulse = 64;	// number of pulses
  length_type nrange = 256;	// number of range cells

  // Maps.
  map_type          map = map_type(num_processors(), 1);
  Replicated_map<1> replica_map;

  // Views.
  replica_view_type replica(nrange, replica_map);
  view_type         data(npulse, nrange, map);
  view_type         tmp (npulse, nrange, map);

  // A forward Fft for computing the frequency-domain version of
  // the replica.
  typedef Fft<const_Vector, value_type, value_type, fft_fwd, by_reference>
		for_fft_type;
  for_fft_type  for_fft (Domain<1>(nrange), 1.0);

  // A forward Fftm for converting the time-domain data matrix to the
  // frequency domain.
  typedef Fftm<value_type, value_type, row, fft_fwd, by_reference>
	  	for_fftm_type;
  for_fftm_type for_fftm(Domain<2>(npulse, nrange), 1.0);

  // An inverse Fftm for converting the frequency-domain data back to
  // the time-domain.
  typedef Fftm<value_type, value_type, row, fft_inv, by_reference>
	  	inv_fftm_type;
  inv_fftm_type inv_fftm(Domain<2>(npulse, nrange), 1.0/(nrange));

  // Initialize data to zero.
  data    = value_type();
  replica = value_type();

  // Before fast convolution, convert the replica to the the
  // frequency domain
  for_fft(replica);


  // Perform fast convolution:

  // Convert to the frequency domain.
  for_fftm(data, tmp);

  // Perform element-wise multiply for each pulse.
  tmp = vmmul<0>(replica, tmp);

  // Convert back to the time domain.
  inv_fftm(tmp, data);
}
 

The following graph shows the parallel speedup of the fast convolution program from 1 to 32 processors using a 3.0 GHz Pentium cluster system. As you can see, increasing the number of processors also increases the performance of the program.