GHC/Data Parallel Haskell/Package NDP

NDP stands for Nested Data Parallellism.

Speed with less convenience: package ndp

The following explains how to install and use the concurrent, high-performance library of strict, segmented arrays called package ndp. Here, strict means that when one element of an array is evaluated, all of them are - we also call this a parallel evaluation semantics. Moreover, segmented means that all operations come in two flavours: one for plain, flat arrays and one for segmented arrays, which are an optimised representation of arrays with one level of nesting. For example, a sum of a segmented array of numbers computes one sum for each segment. Package ndp has a purely functional interface, but internally uses monadic low-level array operations, array fusion, and a many standard GHC optimisations to produce highly optimised code. The library uses GHC's SMP concurrency to transparently parallelise array processing on hardware with multiple cores and/or CPUs.

Installation

Package ndp is only available in source form. The simplest way to build it is to first get and build a source distribution of GHC (preferably the current development version) - see the docu on how to get the sources and how to build them. Then, in the source tree, do the following

% cd libraries
% darcs get http://darcs.haskell.org/packages/ndp/
% make make.library.ndp
% cd ndp/examples
% make

If you don't have darcs, you can alternatively download the ndp-0.1 tar ball. Now, the option -package ndp is available for use with the inplace compiler (i.e., compiler/ghc-inplace). Alternatively, you can install it by invoking make install on the GHC source root and within libraries/ndp/. Then, the option -package ndp can be used in the installed compiler.

A small example

The following module implements the dot product of two vectors with package ndp:

module DotP_ndp (dotp)
where

import Data.Array.Parallel.Unlifted

dotp :: (Num a, UA a) => UArr a -> UArr a -> a
dotp xs ys = sumU (zipWithU (*) xs ys)

We can also use that in an interactive GHCi session:

Prelude> :set -package ndp
package flags have changed, ressetting and loading new packages...
Loading package ndp-1.0 ... linking ... done.
Prelude> :l /home/chak/code/haskell/DotP_ndp
[1 of 1] Compiling DotP_ndp         ( /home/chak/code/haskell/DotP_ndp.hs, interpreted )
Ok, modules loaded: DotP_ndp.
*DotP_ndp> dotp (toU [1..3]) (toU [4..6])
Loading package haskell98 ... linking ... done.
32.0
*DotP_ndp>

Most of the functions under are still purely sequential, albeit much more efficient than GHC.PArr. In addition, the (currently only few) functions from Data.Array.Parallel.Unlifted.Parallel transparently use multiple processing elements if GHC was compiled with SMP multiprocessor support.

More examples

A number of examples of using package ndp are in the examples directory; for details refer to the README file.

Writing NDP programs

You need to import Data.Array.Parallel.Unlifted (sequential combinators) and/or Data.Array.Parallel.Unlifted.Parallel (parallel combinators). The abovementioned examples should prove useful to get a feel for how these combinators are used.

Before invoking any parallel combinators you must initialise the gang threads, usually by calling

Data.Array.Parallel.Unlifted.Distributed.setGang <T>

where <T> is the number of threads. In addition, you need to pass the command line option +RTS -N<T> to the executable; otherwise, you won't get any parallelism. This is a regrettable hack which will go away eventually.

Compiling NDP programs

The file ndp.mk shows the options you'll usually want for compiling NDP programs. You don't need -fbang-patterns unless you actually use them. You might get away with omitting -fmax-simplifier-iterations6; however, sometimes GHC will really need that many iterations to fully optimise NDP programs.

NDP on NUMA machines

NUMA machines, for instance multiprocessors based on AMD Opterons, might benefit from interleaved memory allocation. Under Linux, this can be achieved by running

  numactl --interleave=all <cmd>

where <cmd> is the NDP program you want to run. Also, explicitly setting the processor affinity (taskset in Linux) might be helpful.

Implemented functionality

At the moment, the library only really supports flat arrays. Segmented arrays are provided but you must really know what you are doing to get efficient code. A lot of combinators (especially for segmented arrays) are still missing.

Haddock documentation

Haddock doesn't support various language features used by the library at the moment and can't be used to generate documentation. In principle, haddock.ghc should be able to process the library but it doesn't seem to work at the moment.