https://wiki.haskell.org/api.php?action=feedcontributions&user=Chak&feedformat=atomHaskellWiki - User contributions [en]2015-05-29T08:48:01ZUser contributionsMediaWiki 1.19.14+dfsg-1https://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2012-02-10T01:34:13Z<p>Chak: /* Overview */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_4_1 GHC 7.4] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude for vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed or no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code.<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://repa.ouroborus.net/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Note:''' This page describes version 0.6.* of the DPH libraries. We only support this version of DPH as well as the current development version.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_4_1 GHC 7.4] and then install the DPH libraries with <code>cabal install</code> as follows:<br />
<blockquote><br />
<code>$ cabal update</code><br><br />
<code>$ cabal install dph-examples</code><br />
</blockquote><br />
This will install all DPH packages, including a set of simple examples, see [http://hackage.haskell.org/package/dph-examples dph-examples]. (The package [http://hackage.haskell.org/package/dph-examples dph-examples] does depend on OpenGL and Gloss as both are used in a visualiser for the nobody example.)<br />
<br />
'''WARNING:''' The vanilla GHC distribution does '''not''' include <code>cabal install</code>. This is in contrast to the Haskell Platform, which does include <code>cabal install</code>. If you want to avoid installing the <code>cabal-intstall</code> package and its dependencies explicitly, simply install GHC 7.4.1 in addition to your current Haskell Platform installation. (How to do that depends on your platform and personal preferences. One option is to install a bindist into your home directory with symbolic links to the binaries including the version number.) Then, install DPH with the following command:<br />
<blockquote><br />
<code>cabal install --with-compiler=`which ghc-7.4.1` --with-hc-pkg=`which ghc-pkg-7.4.1` dph-examples</code><br />
</blockquote><br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by using a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph-par library documentation] on Hackage.<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2012-02-06T03:57:58Z<p>Chak: /* Project status */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_4_1 GHC 7.4] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude for vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed or no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code.<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://repa.ouroborus.net/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Note:''' This page describes version 0.6.* of the DPH libraries. We only support this version of DPH as well as the current development version.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_4_1 GHC 7.4] and then install the DPH libraries with <code>cabal install</code> as follows:<br />
<blockquote><br />
<code>$ cabal update</code><br><br />
<code>$ cabal install dph-examples</code><br />
</blockquote><br />
This will install all DPH packages, including a set of simple examples, see [http://hackage.haskell.org/package/dph-examples dph-examples]. (The package [http://hackage.haskell.org/package/dph-examples dph-examples] does depend on OpenGL and Gloss as both are used in a visualiser for the nobody example.)<br />
<br />
'''WARNING:''' The vanilla GHC distribution does '''not''' include <code>cabal install</code>. This is in contrast to the Haskell Platform, which does include <code>cabal install</code>. If you want to avoid installing the <code>cabal-intstall</code> package and its dependencies explicitly, simply install GHC 7.4.1 in addition to your current Haskell Platform installation. (How to do that depends on your platform and personal preferences. One option is to install a bindist into your home directory with symbolic links to the binaries including the version number.) Then, install DPH with the following command:<br />
<blockquote><br />
<code>cabal install --with-compiler=`which ghc-7.4.1` --with-hc-pkg=`which ghc-pkg-7.4.1` dph-examples</code><br />
</blockquote><br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph-par library documentation] on Hackage.<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2012-02-06T03:54:54Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_4_1 GHC 7.4] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude for vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed or no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code.<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://repa.ouroborus.net/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_4_1 GHC 7.4] and then install the DPH libraries with <code>cabal install</code> as follows:<br />
<blockquote><br />
<code>$ cabal update</code><br><br />
<code>$ cabal install dph-examples</code><br />
</blockquote><br />
This will install all DPH packages, including a set of simple examples, see [http://hackage.haskell.org/package/dph-examples dph-examples]. (The package [http://hackage.haskell.org/package/dph-examples dph-examples] does depend on OpenGL and Gloss as both are used in a visualiser for the nobody example.)<br />
<br />
'''WARNING:''' The vanilla GHC distribution does '''not''' include <code>cabal install</code>. This is in contrast to the Haskell Platform, which does include <code>cabal install</code>. If you want to avoid installing the <code>cabal-intstall</code> package and its dependencies explicitly, simply install GHC 7.4.1 in addition to your current Haskell Platform installation. (How to do that depends on your platform and personal preferences. One option is to install a bindist into your home directory with symbolic links to the binaries including the version number.) Then, install DPH with the following command:<br />
<blockquote><br />
<code>cabal install --with-compiler=`which ghc-7.4.1` --with-hc-pkg=`which ghc-pkg-7.4.1` dph-examples</code><br />
</blockquote><br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph-par library documentation] on Hackage.<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2012-02-06T03:51:04Z<p>Chak: /* Project status */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_4_1 GHC 7.4] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude for vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed or no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code.<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://repa.ouroborus.net/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] and then install the DPH libraries with <code>cabal install</code> as follows:<br />
<blockquote><br />
<code>$ cabal update</code><br><br />
<code>$ cabal install dph-examples</code><br />
</blockquote><br />
This will install all DPH packages, including a set of simple examples, see [http://hackage.haskell.org/package/dph-examples dph-examples]. (The package [http://hackage.haskell.org/package/dph-examples dph-examples] does depend on OpenGL and Gloss as both are used in a visualised for the nobody example.)<br />
<br />
'''WARNING:''' The vanilla GHC distribution does '''not''' include <code>cabal install</code>. This is in contrast to the Haskell Platform, which does include <code>cabal install</code>. If you want to avoid installing the <code>cabal-intstall</code> package and its dependencies explicitly, simply install GHC 7.2.1 in addition to your current Haskell Platform installation. (How to do that depends on your platform and personal preferences. One option is to install a bindist into your home directory with symbolic links to the binaries including the version number.) Then, install DPH with the following command:<br />
<blockquote><br />
<code>cabal install --with-compiler=`which ghc-7.2.1` --with-hc-pkg=`which ghc-pkg-7.2.1` dph-examples</code><br />
</blockquote><br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph-par library documentation] on Hackage.<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/HakkuTaikai/AttendeesHakkuTaikai/Attendees2011-08-30T00:34:12Z<p>Chak: /* HakkuTaikai Attendees */</p>
<hr />
<div>= HakkuTaikai Attendees =<br />
<br />
The venue is not open to the public on the day of the Hackathon, so we must submit a list of names to security. If you're name is not on the list, you may not be admitted.<br />
<br />
If you do not wish to announce your attendance in public, please email haskathon★liyang.hu instead.<br />
<br />
{| class="wikitable"<br />
!Nickname<br />
!Real Name<br />
!Affiliation<br />
!Mobile<br />
|-<br />
| liyang<br />
| Liyang HU<br />
| Tsuru Capital LLC<br />
| +81 80 4361 1307<br />
|-<br />
| kfish<br />
| [[User:ConradParker|Conrad Parker]]<br />
| Tsuru Capital SG Pte Ltd<br />
| +81 80 4162 1307<br />
|-<br />
| erikde/m3ga<br />
| [[User:Erik de Castro Lopo|Erik de Castro Lopo]]<br />
| bCODE Pty Ltd<br />
| +61 400 912 480<br />
|-<br />
| kazu<br />
| Kazu Yamamoto<br />
| IIJ<br />
| not public<br />
|-<br />
| tibbe<br />
| Johan Tibell<br />
| Google<br />
| not public<br />
|-<br />
| lpeterse<br />
| Lars Petersen<br />
| -<br />
| not public<br />
|-<br />
| TacticalGrace<br />
| [[User:chak|Manuel Chakravarty]]<br />
| University of New South Wales<br />
| not public<br />
|}</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-08-11T13:42:46Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used and certain idioms are avoided. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] and then install the DPH libraries with <code>cabal install</code> as follows:<br />
<blockquote><br />
<code>$ cabal update</code><br><br />
<code>$ cabal install dph-examples</code><br />
</blockquote><br />
This will install all DPH packages, including a set of simple examples, see [http://hackage.haskell.org/package/dph-examples dph-examples]. (The package [http://hackage.haskell.org/package/dph-examples dph-examples] does depend on OpenGL and Gloss as both are used in a visualised for the nobody example.)<br />
<br />
'''WARNING:''' The vanilla GHC distribution does '''not''' include <code>cabal install</code>. This is in contrast to the Haskell Platform, which does include <code>cabal install</code>. If you want to avoid installing the <code>cabal-intstall</code> package and its dependencies explicitly, simply install GHC 7.2.1 in addition to your current Haskell Platform installation. (How to do that depends on your platform and personal preferences. One option is to install a bindist into your home directory with symbolic links to the binaries including the version number.) Then, install DPH with the following command:<br />
<blockquote><br />
<code>cabal install --with-compiler=`which ghc-7.2.1` --with-hc-pkg=`which ghc-pkg-7.2.1` dph-examples</code><br />
</blockquote><br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph-par library documentation] on Hackage.<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-08-11T12:52:27Z<p>Chak: /* Further examples and documentation */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used and certain idioms are avoided. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] and then install the DPH libraries with <code>cabal install</code> as follows:<br />
<blockquote><br />
<code>$ cabal update</code><br><br />
<code>$ cabal install dph-examples</code><br />
</blockquote><br />
This will install all DPH packages, including a set of simple examples, see [http://hackage.haskell.org/package/dph-examples dph-examples]. (The package [http://hackage.haskell.org/package/dph-examples dph-examples] does depend on OpenGL and Gloss as both are used in a visualised for the nobody example.)<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph-par library documentation] on Hackage.<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-08-11T12:49:32Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used and certain idioms are avoided. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] and then install the DPH libraries with <code>cabal install</code> as follows:<br />
<blockquote><br />
<code>$ cabal update</code><br><br />
<code>$ cabal install dph-examples</code><br />
</blockquote><br />
This will install all DPH packages, including a set of simple examples, see [http://hackage.haskell.org/package/dph-examples dph-examples]. (The package [http://hackage.haskell.org/package/dph-examples dph-examples] does depend on OpenGL and Gloss as both are used in a visualised for the nobody example.)<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph library documentation] on Hackage (which will be uploaded with the GHC 7.2 DPH release).<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-08-11T12:43:17Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used and certain idioms are avoided. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, install [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] and then install the DPH libraries with <hask>cabal install<hask> as follows:<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph library documentation] on Hackage (which will be uploaded with the GHC 7.2 DPH release).<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-08-11T12:36:52Z<p>Chak: /* Project status */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
Data Parallel Haskell (DPH) is available as an add-on for [http://haskell.org/ghc/download_ghc_7_2_1 GHC 7.2] in the form of a few separate cabal package. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used and certain idioms are avoided. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph library documentation] on Hackage (which will be uploaded with the GHC 7.2 DPH release).<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-03-31T09:46:10Z<p>Chak: /* Compiling vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS_GHC -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph library documentation] on Hackage (which will be uploaded with the GHC 7.2 DPH release).<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T05:50:02Z<p>Chak: /* Parallel execution */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.) To determine the runtime of parallel code, measuring CPU time, as demonstrated in the [[GHC/Data Parallel Haskell/MainTimed|timed variant of the dot product example]], is not sufficient anymore. We need to measure wall clock times instead. For proper benchmarking, it is advisable to use a library, such as [http://hackage.haskell.org/package/criterion criterion].<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph library documentation] on Hackage (which will be uploaded with the GHC 7.2 DPH release).<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T05:37:26Z<p>Chak: /* Further examples */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.)<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples and documentation ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/packages/dph/dph-examples/ examples directory of the package dph source]. This code also includes reference implementations for some of the example that are useful for benchmarking. <br />
<br />
The interfaces of the various components of the DPH library are in the [http://hackage.haskell.org/package/dph library documentation] on Hackage (which will be uploaded with the GHC 7.2 DPH release).<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T05:27:48Z<p>Chak: /* Parallel execution */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
By default, a Haskell program uses only one OS thread, and hence, also only one CPU core for execution. To use multiple cores, we need to invoke the executable with an explicit RTS command line option — e.g., <code>./dotp +RTS -N2</code> uses two cores. (Strictly speaking, it uses two OS threads, which will be scheduled on two separate cores if available.)<br />
<br />
A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with just one core and to move to multiple cores only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is often bad as GHC's runtime makes no effort at optimising placement.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T04:54:44Z<p>Chak: /* Using vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par -rtsopts DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend. We include <code>-rtsopts</code> to be able to explicitly determine the number of OS threads used to execute our code.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_Haskell/MainTimedGHC/Data Parallel Haskell/MainTimed2011-01-25T04:34:57Z<p>Chak: New page: The following variant of the main module for the dot product example determines and prints the runtime of the dot product kernel in microseconds. <haskell> import System.CPUTime (getCPUTim...</p>
<hr />
<div>The following variant of the main module for the dot product example determines and prints the runtime of the dot product kernel in microseconds.<br />
<haskell><br />
import System.CPUTime (getCPUTime)<br />
import System.Random (newStdGen)<br />
import Control.Exception (evaluate)<br />
import Data.Array.Parallel.PArray (PArray, randomRs, nf)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
-- generate random input vectors<br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
<br />
-- force the evaluation of the input vectors<br />
evaluate $ nf v<br />
evaluate $ nf w<br />
<br />
-- timed computations<br />
start <- getCPUTime<br />
let result = dotp_wrapper v w<br />
evaluate result<br />
end <- getCPUTime<br />
<br />
-- print the result<br />
putStrLn $ show result ++ " in " ++ show ((end - start) `div` 1000000) ++ "us"<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell></div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T04:32:51Z<p>Chak: /* Generating input data */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. For a variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask>, see [[GHC/Data Parallel Haskell/MainTimed|timed dot product]].<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T04:30:07Z<p>Chak: /* Generating input data */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose. A variant of the dot-product example code that determines the CPU time consumed by <hask>dotp_wrapper</hask> is at [wiki:GHC/Data Parallel Haskell/MainTimed].<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T03:41:06Z<p>Chak: /* Project status */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.2. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T03:23:18Z<p>Chak: /* Using vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded -fdph-par DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime and <code>-fdph-par</code> to link with the standard parallel DPH backend.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T03:15:11Z<p>Chak: /* Compiling vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it as follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fdph-par DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code and <code>-fdph-par</code> selects the standard parallel DPH backend library. (This is currently the only relevant backend, but there may be others in the future.)<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2011-01-25T02:57:49Z<p>Chak: /* Compiling vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# GHC_OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-12T09:59:30Z<p>Chak: /* Feedback */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://www.cse.unsw.edu.au/~benl/ Ben Lippmeier]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-12T07:23:32Z<p>Chak: /* Generating input data */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data set. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-12T07:22:56Z<p>Chak: /* Generating input data */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime.<br />
<br />
==== Generating input data ====<br />
<br />
To see any benefit from parallel execution, a data-parallel program needs to operate on a sufficiently large data sets. Hence, instead of two small constant vectors, we might want to generate some larger input data:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile and link the program as described above.<br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-12T06:08:42Z<p>Chak: /* Using vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable <code>dotp</code> with<br />
<blockquote><br />
<code>ghc -o dotp -threaded DotP.o Main.o</code><br />
</blockquote><br />
We need the <code>-threaded</code> option to link with GHC's multi-threaded runtime.<br />
<br />
==== Generating input data ====<br />
<br />
In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-12T06:07:40Z<p>Chak: /* Using vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a main module that calls the vectorised code, but is itself not vectorised, so that it may contain I/O. In this simple example, we convert two simple lists to parallel arrays, compute their dot product, and print the result:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -Odph Main.hs</code><br />
</blockquote><br />
and finally link the two modules into an executable `dotp` with<br />
<blockquote><br />
<code>ghc -o dotp -threaded DotP.o Main.o</code><br />
</blockquote><br />
We need the `-threaded` option to link with GHC's multi-threaded runtime.<br />
<br />
==== Generating input data ====<br />
<br />
In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-12T06:03:22Z<p>Chak: /* Generating input data */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
<br />
==== Generating input data ====<br />
<br />
In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-12T05:55:10Z<p>Chak: /* Using vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import Data.Array.Parallel.PArray (PArray, fromList)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= let v = fromList [1..10] -- convert lists...<br />
w = fromList [1,2..20] -- ...to parallel arrays<br />
result = dotp_wrapper v w -- invoke vectorised code<br />
in<br />
print result -- print the result<br />
</haskell><br />
<br />
==== Generating input data ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T08:09:47Z<p>Chak: /* Compiling vectorised code */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The syntax for parallel arrays is an extension to Haskell 2010 that needs to be enabled with the language option <hask>ParallelArrays</hask>. Furthermore, we need to explicitly tell GHC if we want to vectorise a module by using the <hask>-fvectorise</hask> option.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE ParallelArrays #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that works best for DPH code.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T07:47:41Z<p>Chak: /* Impedance matching */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays (which might be nested) '''cannot''' be passed. Instead, we need to pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T07:43:36Z<p>Chak: /* Special Prelude */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude.hs Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Float.hs Data.Array.Parallel.Prelude.Float], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Double.hs Data.Array.Parallel.Prelude.Double], [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Int.hs Data.Array.Parallel.Prelude.Int], and [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Word8.hs Data.Array.Parallel.Prelude.Word8]. These four modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). Moreover, we have [http://darcs.haskell.org/packages/dph/dph-common/Data/Array/Parallel/Prelude/Bool.hs Data.Array.Parallel.Prelude.Bool]. If your code needs any other numeric types or functions that are not implemented in these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T04:26:07Z<p>Chak: /* No type classes */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T04:25:15Z<p>Chak: /* Running DPH programs */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code, called ''vectorisation'', that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but for parallel code it dramatically simplifies load balancing.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T04:19:45Z<p>Chak: /* Overview */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines variants of most list operations from the Haskell Prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that are absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T04:13:59Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
To get DPH, you currently need to get the development version of GHC, which automatically includes DPH. We are in the process of preparing a DPH release for GHC 7.0, the current stable release of GHC.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines analogs of most list operations from the Haskell prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2010-12-10T04:10:32Z<p>Chak: /* Project status */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
<center><br />
http://17.media.tumblr.com/VtG26AnzIklk0sh6YkZSLYNPo1_400.png<br />
</center><br />
<br />
''This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. ''<br />
<br />
__TOC__<br />
<br />
<br />
<br />
<br />
=== Project status ===<br />
<br />
We are currently preparing for a release of Data Parallel Haskell (DPH) for GHC 7.0. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance in some cases).<br />
<br />
The current implementation should work well for code with nested parallelism, where the depth of nesting is statically fixed. It should also perform reasonably when nesting is recursive as long as no user-defined nested-parallel datatypes are used. Support for user-defined nested-parallel datatypes is still rather experimental and will likely result in inefficient code. For concrete examples of the various classes of parallelism, please refer to the [http://hackage.haskell.org/trac/ghc/wiki/DataParallel/BenchmarkStatus DPH benchmark status page].<br />
<br />
DPH focuses on irregular data parallelism. For regular data parallel code in Haskell, please consider using the companion library [http://trac.haskell.org/repa/ Repa], which builds on the parallel array infrastructure of DPH.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
DPH is available in the current stable release GHC 6.10.1, which is [http://haskell.org/ghc/download_ghc_6_10_1.html available in source and binary form] for many architectures. If you are compiling 6.10.1 ''from source,'' please ensure that you include the <code>ghc-6.10.1-src-extralibs.tar.bz2</code> archive as it supplies important libraries. GHC distribution binaries should include these libraries by default.<br />
<br />
'''Update [March 2009]:''' The 6.10.1 release has now fallen considerably behind the current development version in the HEAD repository, not only with respect to DPH support, but generally concerning support for multi-core parallelism in the GHC runtime system. Hence, if you are interested in performance and scalability, you need to use the development compiler – with the usual caveats. We are planning a more mature stable release for 6.12. (Due to the scale of the changes involved, we are not able to backport the latest changes to the 6.10.2 release.) To use the code in the HEAD repository, please follow [http://hackage.haskell.org/trac/ghc/wiki/Building/QuickStart the standard build instructions.] Important is that you download ''package dph'' before you build and install the system; you can achieve that with<br />
<br />
./darcs-all --dph get<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines analogs of most list operations from the Haskell prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/6.12-latest/html/libraries/dph-par-0.4.0/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/User:ChakUser:Chak2010-03-22T04:10:03Z<p>Chak: </p>
<hr />
<div>Nothing to see here, really, but check out<br />
<br />
* my [http://www.cse.unsw.edu.au/~chak/ webpage] and<br />
* my [http://justtesting.org blog].<br />
<br />
On [[IRC channel|#haskell]] and #ghc, I am TacticalGrace.</div>Chakhttps://wiki.haskell.org/AusHac2010AusHac20102010-03-18T00:19:00Z<p>Chak: /* Possible Projects */</p>
<hr />
<div>If you've found this page, you use Haskell, ''and'' live in Australia (or at the very least able and willing to travel here), then you're in the right place! We're looking into organising a Haskell [[Hackathon]] some time during the middle of 2010, and this where it shall be organised.<br />
<br />
If you're interested in coming, '''please''' put your name down on the list below, along with your IRC nickname if you're on #haskell, and possibly your email (We'll use this to let you know of any progress we've made, but it's not mandatory). Also, if you've got something to discuss, feel free to add it to the bottom of the page in the Discussion section (just to keep the rest of the page clean and helpful).<br />
<br />
== What we've got so far ==<br />
<br />
===Why===<br />
<br />
Because we miss out on all the fun they have up north, and we've got something to offer. It's also a great chance to meet all these people you talk to on IRC, or read their blogs, and just have a good time, while getting some (potentially) useful work done!<br />
<br />
===When===<br />
<br />
A few dates have been discussed, mainly taking into account when the university holidays are for various universities:<br />
<br />
* ANU: 7 June -> 18 July<br />
* UNSW: 29 June -> 18 July<br />
<br />
So so far we need a weekend between the 28th of June and the 18th of July.<br />
<br />
We're looking at organising it over a weekend, and I (Axman6) would quite like to have it start on a Friday, ending on Sunday. This does not at all mean that those who can’t make the Friday will miss out, the more people we have, the better. But I think that having more time will mean that we can get more done (which is the point right?).<br />
<br />
===Where===<br />
<br />
Manuel Chakravarty and Ben Lippmeier have said there should be no problem finding a room at UNSW, with the only possible problem being Internet access for everyone, but hopefully something can be arranged by that time.<br />
<br />
===Who===<br />
<br />
If you're interested in coming, please show your interest by adding your details to the list below (if you don't have an account, please email me (Axman6) your details and I'll add you).<br />
<br />
<table border="1px"><br />
<tr><br />
<td>Name</td><br />
<td>IRC Nickname</td><br />
<td>Email</td><br />
<td>Availability</td><br />
<td>Preferred date</td><br />
<td>Comment</td><br />
</tr><br />
<br />
<tr><br />
<td>Alex Mason</td><br />
<td>Axman6</td><br />
<td>axman6@gmail.com</td><br />
<td>Probably any weekend during the ANU holidays</td><br />
<td>-</td><br />
<td>Organiser... sort of</td><br />
</tr><br />
<br />
<br />
<tr><br />
<td>[[:User:ivanm|Ivan Miljenovic]]</td><br />
<td>ivanm</td><br />
<td>Ivan <dot> Miljenovic <at> gmail <dot> com</td><br />
<td>*shrug* lazy PhD student, so whenever</td><br />
<td>&nbsp;&nbsp; <=== </td><br />
<td>ditto</td><br />
</tr><br />
<br />
<tr><br />
<td>Tony Morris</td><br />
<td>dibblego</td><br />
<td>code@tmorris.net</td><br />
<td>Nothing specific</td><br />
<td>-</td><br />
<td>Tentative, depending on health</td><br />
</tr><br />
<br />
<tr><br />
<td>Manuel Chakravarty</td><br />
<td>TacticalGrace</td><br />
<td>chak@justtesting.org</td><br />
<td>I'm away 4-11 July; will probably not be able to attend all of it regardless of date</td><br />
<td>Probably weekend of the 18th July</td><br />
<td>Will help getting a room at UNSW</td><br />
</tr><br />
</table><br />
<br />
== Discussion ==<br />
<br />
=== Possible Projects ===<br />
<br />
====Generic graph class====<br />
'''What:''' I (Ivan) last year floated the idea of replacing the current default array-based Graph data type with an extensible set of classes with default instances. There's various interest about this around and I've done some work on it, but if there's anyone else coming it'd be better to bounce ideas together about how to define such classes.<br />
<br />
'''Who:''' Ivan M<br />
<br />
====Gloss-based plots====<br />
'''What:''' Either an alternative graphing back end to Criterion that only relies on OpenGL (through the use of Gloss), or a library for plotting. At the moment Gloss looks like it may only be suitable for bar type graphs, but we'll see. (We may look into writing some other library that's better suited than Gloss, as Gloss is aimed at students learning haskell, and wanting to just get something drawn)<br />
<br />
'''Who:''' Ivan M, Alex M<br />
<br />
====GHC LLVM backend====<br />
'''What:''' The recent work dome by David Terei on an LLVM backend for GHC has shown some fantastic results, and getting it to a point where it could become the default GHC backend is something a lot of people would really like to see.<br />
<br />
'''Who:''' Alex M, Manuel<br />
<br />
====Accelerate====<br />
'''What:''' [http://hackage.haskell.org/package/accelerate Accelerate] is a Haskell EDSL for regular array computations. The aim is to make it generate so blindingly fast code that the C folks start to cry. An LLVM backend is in very early stages of development and a CUDA GPU backend is good enough to run some first small Accelerate programs.<br />
<br />
'''Who:''' Manuel<br />
<br />
=== Dates ===<br />
<br />
<br />
== Related Links ==<br />
<br />
* [[OzHaskell]]</div>Chakhttps://wiki.haskell.org/AusHac2010AusHac20102010-03-18T00:10:47Z<p>Chak: /* Who */</p>
<hr />
<div>If you've found this page, you use Haskell, ''and'' live in Australia (or at the very least able and willing to travel here), then you're in the right place! We're looking into organising a Haskell [[Hackathon]] some time during the middle of 2010, and this where it shall be organised.<br />
<br />
If you're interested in coming, '''please''' put your name down on the list below, along with your IRC nickname if you're on #haskell, and possibly your email (We'll use this to let you know of any progress we've made, but it's not mandatory). Also, if you've got something to discuss, feel free to add it to the bottom of the page in the Discussion section (just to keep the rest of the page clean and helpful).<br />
<br />
== What we've got so far ==<br />
<br />
===Why===<br />
<br />
Because we miss out on all the fun they have up north, and we've got something to offer. It's also a great chance to meet all these people you talk to on IRC, or read their blogs, and just have a good time, while getting some (potentially) useful work done!<br />
<br />
===When===<br />
<br />
A few dates have been discussed, mainly taking into account when the university holidays are for various universities:<br />
<br />
* ANU: 7 June -> 18 July<br />
* UNSW: 29 June -> 18 July<br />
<br />
So so far we need a weekend between the 28th of June and the 18th of July.<br />
<br />
We're looking at organising it over a weekend, and I (Axman6) would quite like to have it start on a Friday, ending on Sunday. This does not at all mean that those who can’t make the Friday will miss out, the more people we have, the better. But I think that having more time will mean that we can get more done (which is the point right?).<br />
<br />
===Where===<br />
<br />
Manuel Chakravarty and Ben Lippmeier have said there should be no problem finding a room at UNSW, with the only possible problem being Internet access for everyone, but hopefully something can be arranged by that time.<br />
<br />
===Who===<br />
<br />
If you're interested in coming, please show your interest by adding your details to the list below (if you don't have an account, please email me (Axman6) your details and I'll add you).<br />
<br />
<table border="1px"><br />
<tr><br />
<td>Name</td><br />
<td>IRC Nickname</td><br />
<td>Email</td><br />
<td>Availability</td><br />
<td>Preferred date</td><br />
<td>Comment</td><br />
</tr><br />
<br />
<tr><br />
<td>Alex Mason</td><br />
<td>Axman6</td><br />
<td>axman6@gmail.com</td><br />
<td>Probably any weekend during the ANU holidays</td><br />
<td>-</td><br />
<td>Organiser... sort of</td><br />
</tr><br />
<br />
<br />
<tr><br />
<td>[[:User:ivanm|Ivan Miljenovic]]</td><br />
<td>ivanm</td><br />
<td>Ivan <dot> Miljenovic <at> gmail <dot> com</td><br />
<td>*shrug* lazy PhD student, so whenever</td><br />
<td>&nbsp;&nbsp; <=== </td><br />
<td>ditto</td><br />
</tr><br />
<br />
<tr><br />
<td>Tony Morris</td><br />
<td>dibblego</td><br />
<td>code@tmorris.net</td><br />
<td>Nothing specific</td><br />
<td>-</td><br />
<td>Tentative, depending on health</td><br />
</tr><br />
<br />
<tr><br />
<td>Manuel Chakravarty</td><br />
<td>TacticalGrace</td><br />
<td>chak@justtesting.org</td><br />
<td>I'm away 4-11 July; will probably not be able to attend all of it regardless of date</td><br />
<td>Probably weekend of the 18th July</td><br />
<td>Will help getting a room at UNSW</td><br />
</tr><br />
</table><br />
<br />
== Discussion ==<br />
<br />
=== Possible Projects ===<br />
<br />
====Generic graph class====<br />
'''What:''' I (Ivan) last year floated the idea of replacing the current default array-based Graph data type with an extensible set of classes with default instances. There's various interest about this around and I've done some work on it, but if there's anyone else coming it'd be better to bounce ideas together about how to define such classes.<br />
<br />
'''Who:''' Ivan M<br />
<br />
====Gloss-based plots====<br />
'''What:''' Either an alternative graphing back end to Criterion that only relies on OpenGL (through the use of Gloss), or a library for plotting. At the moment Gloss looks like it may only be suitable for bar type graphs, but we'll see. (We may look into writing some other library that's better suited than Gloss, as Gloss is aimed at students learning haskell, and wanting to just get something drawn)<br />
<br />
'''Who:''' Ivan M, Alex M<br />
<br />
====GHC LLVM backend====<br />
'''What:''' The recent work dome by David Terei on an LLVM backend for GHC has shown some fantastic results, and getting it to a point where it could become the default GHC backend is something a lot of people would really like to see.<br />
<br />
'''Who:''' Alex M<br />
<br />
=== Dates ===<br />
<br />
<br />
== Related Links ==<br />
<br />
* [[OzHaskell]]</div>Chakhttps://wiki.haskell.org/IPhoneIPhone2009-06-19T20:28:15Z<p>Chak: </p>
<hr />
<div>If you are working with Haskell and making iPhone apps, or if you intend to soon, please fill in your info below.<br />
By helping each other out, we can work more productively and have more fun.<br />
<br />
{|width="80%" border="1" cellpadding="2" cellspacing="0"<br />
|-<br />
!Name<br />
!Contact info<br />
!Haskell-fu (0-5)<br />
!iPhone-fu (0-5)<br />
!Have (to share)<br />
!Need<br />
!Intended iPhone apps<br />
|-<br />
| Conal Elliott<br />
| [http://conal.net Home], [http://conal.net/blog blog], [http://haskell.org/haskellwiki/User:Conal wiki user], [http://twitter.com/conal Twitter], [http://www.facebook.com/profile.php?id=685783314&ref=name Facebook], [http://www.linkedin.com/profile?&key=4476842 Linkedin], IRC: conal<br />
| 5<br />
| 0<br />
| Functional graphics & GUI, misc Haskell libs, design skills<br />
| iPhone basics, Haskell-to-iPhone compiler<br />
| Interactive graphics toys<br />
|-<br />
| Chris Eidhof<br />
| [http://eidhof.nl Home], [http://tupil.com Tupil], [http://haskell.org/haskellwiki/User:ChrisEidhof wiki user], [http://twitter.com/chriseidhof Twitter], [http://www.linkedin.com/pub/chris-eidhof/3/b6/2b6 Linkedin], IRC: chr1s<br />
| 4<br />
| 3<br />
| iPhone experience, web programming experience, dependent types experience<br />
| Haskell-to-iPhone compiler (either as DSL or GHC Core -> iPhone)<br />
| Navigation-based apps (think of things like iTunes, Facebook, etc.), Games (maybe using a combination of FRP and something like arrowlets)<br />
|-<br />
| Daniel Peebles<br />
| [http://pumpkinpat.ch Home], [http://twitter.com/copumpkin Twitter]<br />
| 3<br />
| 4<br />
| Extensive iPhone platform knowledge<br />
| GHC cross-compiling to ARM Mach-O<br />
| Nothing in particular yet<br />
|-<br />
| John Meacham<br />
| [http://repetae.net Home], [http://notanumber.net/ blog]<br />
| -<br />
| -<br />
| Working Haskell to iPhone compiler (jhc)<br />
| Testers and Feedback to make cross compilation smoother. HOC integration with jhc.<br />
| Symbolic Algebra Application, Equation Editor<br />
|-<br />
| Eelco Lempsink<br />
| [http://eelco.lempsink.nl Home], [http://tupil.com Tupil], [http://haskell.org/haskellwiki/User:eelco wiki user], [http://twitter.com/eelco Twitter], [http://www.linkedin.com/in/lempsink Linkedin], IRC: eelco<br />
| 4<br />
| 3<br />
| iPhone and web experience<br />
| Haskell-to-iPhone with (Cocoa Touch) API intergration<br />
| Nothing in particular, looking for a good Haskell use-case :)<br />
|-<br />
| Bernd Brassel<br />
| [http://www-ps.informatik.uni-kiel.de/~bbr Home],[http://www.art2guide.com/index_en.html art2guide]<br />
| 5<br />
| 4<br />
| Haskell experience, iPhone developer<br />
| iPhone embedding into Haskell, good programmers<br />
| audio-visual guiding systems<br />
|-<br />
| Martin Kudlvasr<br />
| [http://trinpad.eu not exactly home],[http://www.linkedin.com/in/martinkudlvasr LinkedIn], irc: trin_cz, xmpp: trin@jabbim.cz<br />
| 3<br />
| 0<br />
| year of haskell experience in OpenGL and project euler<br />
| iPhone basics, Haskell-to-iPhone compiler<br />
| fascinated by reactive, game development<br />
|-<br />
| Sebastiaan Visser<br />
| [http://github.com/sebastiaanvisser Projects], [http://haskell.org/haskellwiki/User:Sebastiaan wiki user], [http://twitter.com/sfvisser Twitter]<br />
| 4<br />
| 0<br />
| Some experience/ideas about building EDSLs.<br />
| Deep EDSL Haskell-to-ObjectiveC, high-level to target GUI/animation. <br />
| Nothing in particular yet. Want to have objective C backend for [http://github.com/sebastiaanvisser/frp-js/tree/master this] EDSL.<br />
|-<br />
| Manuel Chakravarty<br />
| [http://www.cse.unsw.edu.au/~chak/ Home], [http://justtesting.org blog], [http://haskell.org/haskellwiki/User:chak wiki user], [http://twitter.com/TacticalGrace Twitter], [http://www.linkedin.com/in/manuelchakravarty LinkedIn], IRC: Chilli<br />
| 5<br />
| 2<br />
| Haskell EDSL & compiler know how; Objective-C and Cocoa Touch basics<br />
| Haskell tools for iphone dev<br />
| games & productivity apps<br />
|-<br />
|}<br />
<br />
There are at least two ways to use Haskell to make iPhone apps.<br />
One is having a Haskell-to-iPhone compiler, which would probably cross-compile from another host environment (probably Mac OS X).<br />
Another way is to write Haskell programs that ''generate'' iPhone-compatible code when run (rather than when compiled), based on an embedded DSL, similarly to [http://conal.net/papers/jfp-saig/ ''Compiling Embedded Languages''].<br />
<br />
Some helpful resources:<br />
<br />
* [http://iphoneideas.tumblr.com/ Free iPhone ideas] (blog by Chris Eidhof)<br />
* [http://hoc.sourceforge.net/ HOC Haskell to Objective-C binding]<br />
* [http://github.com/sebastiaanvisser/frp-js/tree/master Reactive DSL currently with JS backend]. We might be working on Objective-C backend during Hack-ɸ.<br />
* Stanford course: [http://www.stanford.edu/class/cs193p/ iPhone Application Programming], with online notes, code, and lecture video.<br />
* [http://hackage.haskell.org/trac/ghc/wiki/ObjectiveC Haskell Objective-C FFI proposal] (work-in-progress)</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2009-03-20T00:15:39Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
__TOC__<br />
<br />
=== Project status ===<br />
<br />
A first ''technology preview'' of Data Parallel Haskell (DPH) is included in the 6.10.1 release of GHC. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance).<br />
<br />
The purpose of this technology preview is twofold. Firstly, it gives interested early adopters the opportunity to see where the project is headed and enables them to experiment with simple DPH programs. Secondly, we hope to get user feedback that helps us to guide the project and prioritise those features that our users are most interested in.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
DPH is available in the current stable release GHC 6.10.1, which is [http://haskell.org/ghc/download_ghc_6_10_1.html available in source and binary form] for many architectures. If you are compiling 6.10.1 ''from source,'' please ensure that you include the <code>ghc-6.10.1-src-extralibs.tar.bz2</code> archive as it supplies important libraries. GHC distribution binaries should include these libraries by default.<br />
<br />
'''Update [March 2009]:''' The 6.10.1 release has now fallen considerably behind the current development version in the HEAD repository, not only with respect to DPH support, but generally concerning support for multi-core parallelism in the GHC runtime system. Hence, if you are interested in performance and scalability, you need to use the development compiler – with the usual caveats. We are planning a more mature stable release for 6.12. (Due to the scale of the changes involved, we are not able to backport the latest changes to the 6.10.2 release.) To use the code in the HEAD repository, please follow [http://hackage.haskell.org/trac/ghc/wiki/Building/QuickStart the standard build instructions.] Important is that you download ''package dph'' before you build and install the system; you can achieve that with<br />
<br />
./darcs-all --dph get<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines analogs of most list operations from the Haskell prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2009-03-09T10:14:43Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
__TOC__<br />
<br />
=== Project status ===<br />
<br />
A first ''technology preview'' of Data Parallel Haskell (DPH) is included in the 6.10.1 release of GHC. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance).<br />
<br />
The purpose of this technology preview is twofold. Firstly, it gives interested early adopters the opportunity to see where the project is headed and enables them to experiment with simple DPH programs. Secondly, we hope to get user feedback that helps us to guide the project and prioritise those features that our users are most interested in.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
Currently, we recommend to use the implementation in GHC 6.10.1. It is [http://haskell.org/ghc/download_ghc_6_10_1.html available in source and binary form] for many architectures. (Please use the version in the HEAD repository of GHC only if you are a GHC developer or a very experienced GHC user and if you know the current status of the DPH code – intermediate versions may well be broken while we implement major changes.)<br />
<br />
Note that in addition to the compiler you will need the <code>dph</code> package; this is included in the <code>ghc-XXX-src-extralibs.tar.bz2</code> archive, but can also be retrieved by invoking:<br />
<br />
./sync-all --dph get<br />
<br />
from within the compiler source tree.<br />
<br />
'''Note on versions:''' The 6.10.1 release has now fallen considerably behind the current development version in the HEAD repository, not only with respect to DPH support, but generally concerning support for multi-core parallelism in the GHC runtime system. Hence, if you are interested in performance and scalability, you need to use development compiler – with the usual caveats. We are planning a more mature stable release for 6.12. (Due to the scale of the changes involved, we are not able to backport the latest changes to the 6.10.2 release.)<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines analogs of most list operations from the Haskell prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2009-03-09T10:13:14Z<p>Chak: /* Where to get it */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
__TOC__<br />
<br />
=== Project status ===<br />
<br />
A first ''technology preview'' of Data Parallel Haskell (DPH) is included in the 6.10.1 release of GHC. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance).<br />
<br />
The purpose of this technology preview is twofold. Firstly, it gives interested early adopters the opportunity to see where the project is headed and enables them to experiment with simple DPH programs. Secondly, we hope to get user feedback that helps us to guide the project and prioritise those features that our users are most interested in.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
Currently, we recommend to use the implementation in GHC 6.10.1. It is [http://haskell.org/ghc/download_ghc_6_10_1.html available in source and binary form] for many architectures. (Please use the version in the HEAD repository of GHC only if you are a GHC developer or a very experienced GHC user and if you know the current status of the DPH code – intermediate versions may well be broken while we implement major changes.)<br />
<br />
Note that in addition to the compiler you will need the <code>dph</code> package; this is included in the <code>ghc-XXX-src-extralibs.tar.bz2</code> archive, but can also be retrieved by invoking:<br />
<br />
./sync-all --dph get<br />
<br />
from within the compiler source tree.<br />
<br />
'''Note on versions:''' The 6.10.1 release has now fallen considerably behind the current development version in the HEAD repository, not only with respect to DPH support, but generally concerning support for multi-core parallelism in the GHC runtime system. Hence, if you are interested in performance and scalability, you need to use development compiler – with the usual caveats. We are planning a more mature stable release for 6.12.<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines analogs of most list operations from the Haskell prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/Gtk2HsGtk2Hs2009-03-04T02:45:54Z<p>Chak: /* Using the GTK+ OS X Framework */</p>
<hr />
<div>[[Category:User interfaces]]<br />
== What is it? ==<br />
<br />
Gtk2Hs is a Haskell binding to Gtk+ 2.x.<br />
Using it, one can write Gtk+ based applications with GHC.<br />
<br />
== Homepage ==<br />
<br />
http://haskell.org/gtk2hs/<br />
<br />
== Status ==<br />
<br />
It currently works with Gtk+ 2.0 through to 2.8 on Unix, Win32 and MacOS X.<br />
The widget function coverage is almost complete, only a few minor bits and pieces are missing.<br />
<br />
It currently builds with ghc 5.04.3 through to 6.8.2<br />
<br />
== Installation Notes ==<br />
=== Mac OS X ===<br />
<br />
==== Using the GTK+ OS X Framework ====<br />
<br />
This explains how to install Gtk2Hs on Macs using the native [http://www.gtk-osx.org/ GTK+ OS X Framework], a port of GTK+ to the Mac that does '''not''' depend on X11, and hence, is better integrated into the Mac desktop - i.e., menus actually appear in the menu bar, where they belong. It also avoids the often tedious installation of GTK+ via MacPorts. However, it misses support for optional Gtk2Hs packages that are currently not supported by the [http://www.gtk-osx.org/ GTK+ OS X Framework], most notably support for Glade. It does include support for Cairo, though.<br />
<br />
Here is how to install the library:<br />
# Download and install [http://www.gtk-osx.org/ GTK+ OS X Framework] (this uses the standard Mac package installer).<br />
# Install [http://pkg-config.freedesktop.org/ pkg-config], either by compiling it from source or via MacPorts.<br />
# Download and unpack the Gtk2Hs tar ball from the [http://www.haskell.org/gtk2hs/downloads/ Gtk2Hs download page] (I tested 0.10.0).<br />
# Configure with (you may want to remove the two backslashes and put everything on one line)<br />
env PKG_CONFIG_PATH=/Library/Frameworks/Cairo.framework/Resources/dev/lib/pkgconfig:\ <br />
/Library/Frameworks/GLib.framework/Resources/dev/lib/pkgconfig:\ <br />
/Library/Frameworks/Gtk.framework/Resources/dev/lib/pkgconfig ./configure --disable-gio<br />
# Build with<br />
make<br />
# Install (to <tt>/usr/local/</tt> unless a <tt>--prefix</tt> option was passed to <tt>configure</tt>) with<br />
sudo make install<br />
<br />
The library is now registered with the package database of the GHC you used for compiling.<br />
<br />
NB: Thanks to Ross Mellgren for his post on the gtk2hs-users list that outlined the use of <tt>PKG_CONFIG_PATH</tt>.<br />
<br />
==== Article as of Mid 2008 ====<br />
Installing Gtk2Hs on Mac requires some finesse, at least until Haskell Libary Platform is built or ghc-6.8.3 is <br />
available in macports. (These are planned for late 2008.)<br />
<br />
* Install [http://macports.org MacPorts]<br />
* Install dependencies:<br />
sudo port instll glade3 libglade2 gstreamer gst-plugins-base gtksourceview cairo librsvg gtkglext firefox<br />
* Update PKG_CONFIG_PATH (for libraries)<br />
export PKG_CONFIG_PATH=/usr/lib/pkgconfig:/usr/local/lib/pkgconfig:/opt/local/lib/pkgconfig<br />
* Update ghc to use macports libs: Edit your main <tt>ghc</tt> driver program and change the last line to:<br />
exec $GHCBIN $TOPDIROPT ${1+"$@"} -L/opt/local/lib -I/opt/local/include<br />
* Download Gtk2Hs following instructions at [http://www.haskell.org/gtk2hs/downloads/ Gtk2Hs Download page]<br />
* Check configuration:<br />
./configure --enable-docs --enable-profiling<br />
<br />
...<br />
<br />
**************************************************<br />
* Configuration completed successfully. <br />
* <br />
* The following packages will be built: <br />
* <br />
* glib : yes <br />
* gtk : yes <br />
* glade : yes <br />
* cairo : yes <br />
* svgcairo : yes <br />
* gtkglext : yes <br />
* gconf : yes <br />
* sourceview : yes <br />
* mozembed : yes <br />
* soegtk : yes <br />
* gnomevfs : yes <br />
* gstreamer : yes <br />
* documentation : yes <br />
* <br />
* Now do "(g)make" followed by "(g)make install"<br />
**************************************************<br />
* Build and Install:<br />
make <br />
sudo make install<br />
==== Recent experiences ====<br />
I successfully installed the latest version on Mac OS 10.5 by:<br />
* Installing Macports.<br />
* <tt>sudo port install ghc</tt><br />
* <tt>sudo port install gtk2hs</tt> - which does not complete successfully. It does however, install the appropriate dependencies. Note that there are so many, you may need to install a couple of times due to time outs etc.. The build of Gtk2HS will fail, but that is ok - continue as below.<br />
* Remove the build directory under <tt>/opt/.../build/gtk2hs</tt><br />
* Download Gtk2Hs via darcs as per [http://haskell.org/gtk2hs/development/#darcs the gtk2hs download instructions]<br />
* do a <tt>sudo port install automake</tt><br />
* do a <tt>sudo port install alex</tt><br />
* do a <tt>sudo port install happy</tt> (Note this also fails and must be built from source. See the [[Happy]] page for details.)<br />
* Follow the build instructions on the [http://haskell.org/gtk2hs/development/#darcs the gtk2hs download page]. I would suggest using <tt>./configure --prefix=/opt/local</tt> to get it in the same place as ports - personal preference though.<br />
Good luck - as usual, your mileage may vary.<br />
<br />
== Demos ==<br />
<br />
=== OpenGL and Gtk2Hs ===<br />
<br />
[[Gtk2Hs/Demos/GtkGLext/hello.hs]]<br />
<br />
[[Gtk2Hs/Demos/GtkGLext/terrain.hs]] requires [[Gtk2Hs/Demos/GtkGLext/terrain.xpm]]<br />
<br />
==FAQs==<br />
These are links to FAQS on the main site.<br />
*[http://haskell.org/gtk2hs/archives/2005/06/23/hiding-the-console-on-windows/#more-26 Hiding the console on windows]<br />
*[http://haskell.org/gtk2hs/archives/2005/07/24/writing-multi-threaded-guis/#more-38 Writing multi-threaded GUIs]<br />
*[http://haskell.org/gtk2hs/archives/2005/06/24/building-from-source-on-windows/#more-15 Building on Windows]<br />
*[http://haskell.org/gtk2hs/development/#darcs Checkout instructions]. Also see [[Darcs]]<br />
<br />
[[Category:Applications]]</div>Chakhttps://wiki.haskell.org/Gtk2HsGtk2Hs2009-03-04T02:45:19Z<p>Chak: Use of the GTK+ OS X Framework</p>
<hr />
<div>[[Category:User interfaces]]<br />
== What is it? ==<br />
<br />
Gtk2Hs is a Haskell binding to Gtk+ 2.x.<br />
Using it, one can write Gtk+ based applications with GHC.<br />
<br />
== Homepage ==<br />
<br />
http://haskell.org/gtk2hs/<br />
<br />
== Status ==<br />
<br />
It currently works with Gtk+ 2.0 through to 2.8 on Unix, Win32 and MacOS X.<br />
The widget function coverage is almost complete, only a few minor bits and pieces are missing.<br />
<br />
It currently builds with ghc 5.04.3 through to 6.8.2<br />
<br />
== Installation Notes ==<br />
=== Mac OS X ===<br />
<br />
=== Using the GTK+ OS X Framework ===<br />
<br />
This explains how to install Gtk2Hs on Macs using the native [http://www.gtk-osx.org/ GTK+ OS X Framework], a port of GTK+ to the Mac that does '''not''' depend on X11, and hence, is better integrated into the Mac desktop - i.e., menus actually appear in the menu bar, where they belong. It also avoids the often tedious installation of GTK+ via MacPorts. However, it misses support for optional Gtk2Hs packages that are currently not supported by the [http://www.gtk-osx.org/ GTK+ OS X Framework], most notably support for Glade. It does include support for Cairo, though.<br />
<br />
Here is how to install the library:<br />
# Download and install [http://www.gtk-osx.org/ GTK+ OS X Framework] (this uses the standard Mac package installer).<br />
# Install [http://pkg-config.freedesktop.org/ pkg-config], either by compiling it from source or via MacPorts.<br />
# Download and unpack the Gtk2Hs tar ball from the [http://www.haskell.org/gtk2hs/downloads/ Gtk2Hs download page] (I tested 0.10.0).<br />
# Configure with (you may want to remove the two backslashes and put everything on one line)<br />
env PKG_CONFIG_PATH=/Library/Frameworks/Cairo.framework/Resources/dev/lib/pkgconfig:\ <br />
/Library/Frameworks/GLib.framework/Resources/dev/lib/pkgconfig:\ <br />
/Library/Frameworks/Gtk.framework/Resources/dev/lib/pkgconfig ./configure --disable-gio<br />
# Build with<br />
make<br />
# Install (to <tt>/usr/local/</tt> unless a <tt>--prefix</tt> option was passed to <tt>configure</tt>) with<br />
sudo make install<br />
<br />
The library is now registered with the package database of the GHC you used for compiling.<br />
<br />
NB: Thanks to Ross Mellgren for his post on the gtk2hs-users list that outlined the use of <tt>PKG_CONFIG_PATH</tt>.<br />
<br />
==== Article as of Mid 2008 ====<br />
Installing Gtk2Hs on Mac requires some finesse, at least until Haskell Libary Platform is built or ghc-6.8.3 is <br />
available in macports. (These are planned for late 2008.)<br />
<br />
* Install [http://macports.org MacPorts]<br />
* Install dependencies:<br />
sudo port instll glade3 libglade2 gstreamer gst-plugins-base gtksourceview cairo librsvg gtkglext firefox<br />
* Update PKG_CONFIG_PATH (for libraries)<br />
export PKG_CONFIG_PATH=/usr/lib/pkgconfig:/usr/local/lib/pkgconfig:/opt/local/lib/pkgconfig<br />
* Update ghc to use macports libs: Edit your main <tt>ghc</tt> driver program and change the last line to:<br />
exec $GHCBIN $TOPDIROPT ${1+"$@"} -L/opt/local/lib -I/opt/local/include<br />
* Download Gtk2Hs following instructions at [http://www.haskell.org/gtk2hs/downloads/ Gtk2Hs Download page]<br />
* Check configuration:<br />
./configure --enable-docs --enable-profiling<br />
<br />
...<br />
<br />
**************************************************<br />
* Configuration completed successfully. <br />
* <br />
* The following packages will be built: <br />
* <br />
* glib : yes <br />
* gtk : yes <br />
* glade : yes <br />
* cairo : yes <br />
* svgcairo : yes <br />
* gtkglext : yes <br />
* gconf : yes <br />
* sourceview : yes <br />
* mozembed : yes <br />
* soegtk : yes <br />
* gnomevfs : yes <br />
* gstreamer : yes <br />
* documentation : yes <br />
* <br />
* Now do "(g)make" followed by "(g)make install"<br />
**************************************************<br />
* Build and Install:<br />
make <br />
sudo make install<br />
==== Recent experiences ====<br />
I successfully installed the latest version on Mac OS 10.5 by:<br />
* Installing Macports.<br />
* <tt>sudo port install ghc</tt><br />
* <tt>sudo port install gtk2hs</tt> - which does not complete successfully. It does however, install the appropriate dependencies. Note that there are so many, you may need to install a couple of times due to time outs etc.. The build of Gtk2HS will fail, but that is ok - continue as below.<br />
* Remove the build directory under <tt>/opt/.../build/gtk2hs</tt><br />
* Download Gtk2Hs via darcs as per [http://haskell.org/gtk2hs/development/#darcs the gtk2hs download instructions]<br />
* do a <tt>sudo port install automake</tt><br />
* do a <tt>sudo port install alex</tt><br />
* do a <tt>sudo port install happy</tt> (Note this also fails and must be built from source. See the [[Happy]] page for details.)<br />
* Follow the build instructions on the [http://haskell.org/gtk2hs/development/#darcs the gtk2hs download page]. I would suggest using <tt>./configure --prefix=/opt/local</tt> to get it in the same place as ports - personal preference though.<br />
Good luck - as usual, your mileage may vary.<br />
<br />
== Demos ==<br />
<br />
=== OpenGL and Gtk2Hs ===<br />
<br />
[[Gtk2Hs/Demos/GtkGLext/hello.hs]]<br />
<br />
[[Gtk2Hs/Demos/GtkGLext/terrain.hs]] requires [[Gtk2Hs/Demos/GtkGLext/terrain.xpm]]<br />
<br />
==FAQs==<br />
These are links to FAQS on the main site.<br />
*[http://haskell.org/gtk2hs/archives/2005/06/23/hiding-the-console-on-windows/#more-26 Hiding the console on windows]<br />
*[http://haskell.org/gtk2hs/archives/2005/07/24/writing-multi-threaded-guis/#more-38 Writing multi-threaded GUIs]<br />
*[http://haskell.org/gtk2hs/archives/2005/06/24/building-from-source-on-windows/#more-15 Building on Windows]<br />
*[http://haskell.org/gtk2hs/development/#darcs Checkout instructions]. Also see [[Darcs]]<br />
<br />
[[Category:Applications]]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2008-12-08T04:54:49Z<p>Chak: /* Feedback */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
__TOC__<br />
<br />
=== Project status ===<br />
<br />
A first ''technology preview'' of Data Parallel Haskell (DPH) is included in the 6.10.1 release of GHC. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance).<br />
<br />
The purpose of this technology preview is twofold. Firstly, it gives interested early adopters the opportunity to see where the project is headed and enables them to experiment with simple DPH programs. Secondly, we hope to get user feedback that helps us to guide the project and prioritise those features that our users are most interested in.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
Currently, we recommend to use the implementation in GHC 6.10.1. It is [http://haskell.org/ghc/download_ghc_6_10_1.html available in source and binary form] for many architectures. (Please use the version in the HEAD repository of GHC only if you are a GHC developer or a very experienced GHC user and if you know the current status of the DPH code – intermediate versions may well be broken while we implement major changes.)<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines analogs of most list operations from the Haskell prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double,dotp_wrapper)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinskiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_HaskellGHC/Data Parallel Haskell2008-12-02T13:33:20Z<p>Chak: /* Parallel execution */</p>
<hr />
<div>[[Category:GHC|Data Parallel Haskell]]<br />
== Data Parallel Haskell ==<br />
<br />
''Data Parallel Haskell'' is the codename for an extension to the Glasgow Haskell Compiler and its libraries to support [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html nested data parallelism] with a focus to utilise multicore CPUs. Nested data parallelism extends the programming model of flat data parallelism, as known from parallel Fortran dialects, to irregular parallel computations (such as divide-and-conquer algorithms) and irregular data structures (such as sparse matrices and tree structures). An introduction to nested data parallelism in Haskell, including some examples, can be found in the paper [http://www.cse.unsw.edu.au/~chak/papers/papers.html#ndp-haskell Nepal – Nested Data-Parallelism in Haskell]. <br />
<br />
=== Project status ===<br />
<br />
A first ''technology preview'' of Data Parallel Haskell (DPH) is included in the 6.10.1 release of GHC. All major components of DPH are implemented, including code vectorisation and parallel execution on multicore systems. However, the implementation has many limitations and probably also many bugs. Major limitations include the inability to mix vectorised and non-vectorised code in a single Haskell module, the need to use a feature-deprived, special-purpose Prelude in vectorised code, and a lack of optimisations (leading to poor performance).<br />
<br />
The purpose of this technology preview is twofold. Firstly, it gives interested early adopters the opportunity to see where the project is headed and enables them to experiment with simple DPH programs. Secondly, we hope to get user feedback that helps us to guide the project and prioritise those features that our users are most interested in.<br />
<br />
'''Disclaimer:''' Data Parallel Haskell is very much '''work in progress.''' Some components are already usable, and we explain here how to use them. However, please be aware that APIs are still in flux and functionality may change during development.<br />
<br />
=== Where to get it ===<br />
<br />
Currently, we recommend to use the implementation in GHC 6.10.1. It is [http://haskell.org/ghc/download_ghc_6_10_1.html available in source and binary form] for many architectures. (Please use the version in the HEAD repository of GHC only if you are a GHC developer or a very experienced GHC user and if you know the current status of the DPH code – intermediate versions may well be broken while we implement major changes.)<br />
<br />
=== Overview ===<br />
<br />
From a user's point of view, Data Parallel Haskell adds a new data type to Haskell –namely, ''parallel arrays''– as well as operations on parallel arrays. Syntactically, parallel arrays are like lists, only that instead of square brackets <hask>[</hask> and <hask>]</hask>, parallel arrays use square brackets with a colon <hask>[:</hask> and <hask>:]</hask>. In particular, <hask>[:e:]</hask> is the type of parallel arrays with elements of type <hask>e</hask>; the expression <hask>[:x, y, z:]</hask> denotes a three element parallel array with elements <hask>x</hask>, <hask>y</hask>, and <hask>z</hask>; and <hask>[:x + 1 | x <- xs:]</hask> represents a simple array comprehension. More sophisticated array comprehensions (including the equivalent of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions]) as well as enumerations and pattern matching proceed in an analog manner. Moreover, the array library of DPH defines analogs of most list operations from the Haskell prelude and the standard <hask>List</hask> library (e.g., we have <hask>lengthP</hask>, <hask>sumP</hask>, <hask>mapP</hask>, and so on).<br />
<br />
The two main differences between lists and parallel arrays are that (1) parallel arrays are a strict data structure and (2) that they are not inductively defined. Parallel arrays are strict in that by demanding a single element, all elements of an array are demanded. Hence, all elements of a parallel array might be evaluated in parallel. To facilitate such parallel evaluation, operations on parallel arrays should treat arrays as aggregate structures that are manipulated in their entirety (instead of the inductive, element-wise processing that is the foundation of all Haskell list functions.)<br />
<br />
As a consequence, parallel arrays are always finite, and standard functions that yield infinite lists, such as <hask>enumFrom</hask> and <hask>repeat</hask>, have no corresponding array operation. Moreover, parallel arrays only have an undirected fold function <hask>foldP</hask> that requires an associative function as an argument – such a fold function has a parallel step complexity of O(log ''n'') for arrays of length ''n''. Parallel arrays also come with some aggregate operations that absent from the standard list library, such as <hask>permuteP</hask>.<br />
<br />
=== A simple example ===<br />
<br />
As a simple example of a DPH program, consider the following code that computes the dot product of two vectors given as parallel arrays:<br />
<haskell><br />
dotp :: Num a => [:a:] -> [:a:] -> a<br />
dotp xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
This code uses an array variant of [http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#parallel-list-comprehensions parallel list comprehensions], which could alternatively be written as <hask>[:x * y | (x, y) <- zipP xs ys:]</hask>, but should otherwise be self-explanatory to any Haskell programmer.<br />
<br />
=== Running DPH programs ===<br />
<br />
Unfortunately, we cannot use the above implementation of <hask>dotp</hask> directly in the current preliminary implementation of DPH. In the following, we will discuss how the code needs to be modified and how it needs to be compiled and run for parallel execution. GHC applies an elaborate transformation to DPH code called ''vectorisation'' that turns nested into flat data parallelism. This transformation is only useful for code that is executed in parallel (i.e., code that manipulates parallel arrays), but there it raises the level of expressiveness dramatically.<br />
<br />
==== No type classes ====<br />
<br />
Unfortunately, vectorisation does not properly handle type classes at the moment. Hence, we currently need to avoid overloaded operations in parallel code. To account for that limitation, we specialise <hask>dotp</hask> on doubles.<br />
<haskell><br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
</haskell><br />
<br />
==== Special Prelude ====<br />
<br />
As the current implementation of vectorisation cannot handle some language constructs, we cannot use it to vectorise those parts of the standard Prelude that might be used in parallel code (such as arithmetic operations). Instead, DPH comes with its own (rather limited) Prelude in [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude.html Data.Array.Parallel.Prelude] plus three extra modules to support one numeric type each [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Double.html Data.Array.Parallel.Prelude.Double], [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Int.html Data.Array.Parallel.Prelude.Int], and [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-Prelude-Word8.html Data.Array.Parallel.Prelude.Word8]. These three modules support the same functions (on different types) and if a program needs to use more than one, they need to be imported qualified (as we cannot use type classes in vectorised code in the current version). If your code needs any other numeric types or functions that are not implemented in the these Prelude modules, you currently need to implement and vectorise that functionality yourself.<br />
<br />
To compile <hask>dotp_double</hask>, we add the following three import statements:<br />
<haskell><br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
</haskell><br />
<br />
==== Impedance matching ====<br />
<br />
Special care is needed at the interface between vectorised and non-vectorised code. Currently, only simple types can be passed between these different kinds of code. In particular, parallel arrays '''cannot''' be passed. Instead, we can pass flat arrays of type <hask>PArray</hask>. This type is exported by our special-purpose Prelude together with a conversion function <hask>fromPArrayP</hask> (which is specific to the element type due to the lack of type classes in vectorised code). <br />
<br />
Using this conversion function, we define a wrapper function for <hask>dotp_double</hask> that we export and use from non-vectorised code.<br />
<haskell><br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
It is important to mark this function as <hask>NOINLINE</hask> as we don't want it to be inlined into non-vectorised code.<br />
<br />
==== Compiling vectorised code ====<br />
<br />
The definition of <hask>dotp_double</hask> requires two language extensions, namely <hask>PArr</hask> to enable the syntax of parallel arrays and <hask>ParallelListComp</hask> for the parallel comprehension. Furthermore, we need to explicitly tell GHC which modules we want to vectorise.<br />
<br />
Currently, GHC either vectorises all code in a module or none. This can be inconvenient as some parts of a program cannot be vectorised – for example, code in the <hask>IO</hask> monad (the radical re-ordering of computations performed by the vectorisation transformation is only valid for pure code). As a consequence, the programmer currently needs to partition vectorised and non-vectorised code carefully over different modules.<br />
<br />
The compiler option to enable vectorisation is <code>-fvectorise</code>. Overall, we get the following complete module definition for the dot-product code:<br />
<haskell><br />
{-# LANGUAGE PArr, ParallelListComp #-}<br />
{-# OPTIONS -fvectorise #-}<br />
<br />
module DotP (dotp_double)<br />
where<br />
<br />
import qualified Prelude<br />
import Data.Array.Parallel.Prelude<br />
import Data.Array.Parallel.Prelude.Double<br />
<br />
dotp_double :: [:Double:] -> [:Double:] -> Double<br />
dotp_double xs ys = sumP [:x * y | x <- xs | y <- ys:]<br />
<br />
dotp_wrapper :: PArray Double -> PArray Double -> Double<br />
{-# NOINLINE dotp_wrapper #-}<br />
dotp_wrapper v w = dotp_double (fromPArrayP v) (fromPArrayP w)<br />
</haskell><br />
Assuming the module is in a file <hask>DotP.hs</hask>, we compile it was follows:<br />
<blockquote><br />
<code>ghc -c -Odph -fcpr-off -fdph-seq DotP.hs</code><br />
</blockquote><br />
The option <code>-Odph</code> enables a predefined set of GHC optimisation options that is geared at optimising DPH code. Moreover, we use <code>-fcpr-off</code> as GHC's CPR phase doesn't play nice with type families at the moment, which in turn are heavily used in the DPH library. We shall discuss <code>-fdph-seq</code> below.<br />
<br />
==== Using vectorised code ====<br />
<br />
Finally, we need a wrapper module that calls the vectorised code, but is itself not vectorised. In this simple example, this is just the <hask>Main</hask> module that generates two random vectors and computes their dot product:<br />
<haskell><br />
import System.Random (newStdGen)<br />
import Data.Array.Parallel.PArray (PArray, randomRs)<br />
<br />
import DotP (dotp_wrapper) -- import vectorised code<br />
<br />
main :: IO ()<br />
main<br />
= do <br />
gen1 <- newStdGen<br />
gen2 <- newStdGen<br />
let v = randomRs n range gen1<br />
w = randomRs n range gen2<br />
print $ dotp_wrapper v w -- invoke vectorised code and print the result<br />
where<br />
n = 10000 -- vector length<br />
range = (-100, 100) -- range of vector elements<br />
</haskell><br />
We compile this module with<br />
<blockquote><br />
<code>ghc -c -O -fdph-seq Main.hs</code><br />
</blockquote><br />
and finally link with<br />
<blockquote><br />
<code>ghc -o dotp -fdph-seq -threaded DotP.o Main.o</code><br />
</blockquote><br />
<br />
'''NOTE:''' The code as presented is unsuitable for benchmarking as we wouldn't want to measure the purely sequential random number generation (that dominates this simple program). For benchmarking, we would want to guarantee that the generated vectors are fully evaluated before taking the time. The module [http://www.haskell.org/ghc/docs/latest/html/libraries/dph-par/Data-Array-Parallel-PArray.html Data.Array.Parallel.PArray] exports the function <hask>nf</hask> for this purpose.<br />
<br />
==== Parallel execution ====<br />
<br />
The array library of DPH comes in two flavours: <code>dph-seq</code> and <code>dph-par</code>. The former supports the whole DPH stack, but only executes on a single core. In contrast, <code>dph-par</code> implements multi-threaded code. <br />
<br />
In the above compiler invocations, we used the option <code>-fdph-seq</code> to select the <code>dph-seq</code> flavour. We can as well compile with <code>-fdph-par</code> to generate multi-threaded code. By invoking <code>./dotp +RTS -N2</code>, we use two OS threads to execute the program. A beautiful property of DPH is that the number of threads used to execute a program only affects its performance, but not the result. So, it is fine to do all debugging concerning correctness with <code>dph-seq</code> and to switch to <code>dph-par</code> only for performance debugging.<br />
<br />
Data Parallel Haskell –and more generally, GHC's multi-threading support– currently only aims at multicore processors or uniform memory access (UMA) multi-processors. Performance on non-uniform memory access (NUMA) machines is generally bad as GHC's runtime makes no effort at optimising placement. Some people have reported that the parallel garbage collector (as included in GHC 6.10.1) should not be used with parallel programs; i.e., it is advisable to start parallel programs with <code>my_program +RTS -N2 -g1</code> to run on two cores (and different arguments to <code>-N</code> for other core counts). This problem has been addressed in the development version of GHC.<br />
<br />
=== Further examples ===<br />
<br />
Further examples are available in the [http://darcs.haskell.org/ghc-6.10/packages/dph/examples/ examples directory of the package dph source]. In addition to code using vectorisation (as described above), these examples also contain code that directly targets the two array libraries contained in <code>-package dph-seq</code> and <code>-package dph-par</code>, respectively. For more complex programs, targeting the DPH array libraries directly can lead to much faster code than using vectorisation, as GHC currently doesn't optimise vectorised code very well. However, code targeting the DPH libraries directly can only use flat data parallelism.<br />
<br />
The interfaces of the various components of the DPH library are specified in GHC's [http://www.haskell.org/ghc/docs/latest/html/libraries/index.html hierarchical libraries documentation].<br />
<br />
=== Designing parallel programs ===<br />
<br />
Data Parallel Haskell is a high-level language to code parallel algorithms. Like plain Haskell, DPH frees the programmer from many low-level operational considerations (such as thread creation, thread synchronisation, critical sections, and deadlock avoidance). Nevertheless, the full responsibility for parallel algorithm design and many performance considerations (such as when does a computation have sufficient parallelism to make it worthwhile to exploit that parallelism) are still with the programmer.<br />
<br />
DPH encourages a data-driven style of parallel programming and, in good Haskell tradition, puts the choice of data types first. Specifically, the choice between using lists or parallel arrays for a data structure determines whether operations on the structure will be executed sequentially or in parallel. In addition to suitably combining standard lists and parallel arrays, it is often also useful to embed parallel arrays in a user-defined inductive structure, such as the following definition of parallel rose trees:<br />
<haskell><br />
data RTree a = RNode [:RTree a:]<br />
</haskell><br />
The tree is inductively defined; hence, tree traversals will proceed sequentially, level by level. However, the children of each node are held in parallel arrays, and hence, may be traversed in parallel. This structure is, for example, useful in parallel adaptive algorithms based on a hierarchical decomposition, such as the Barnes-Hut algorithm for solving the ''N''-body problem as discussed in more detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.]<br />
<br />
For a general introduction to nested data parallelism and its cost model, see Blelloch's [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.]<br />
<br />
=== Further reading and information on the implementation ===<br />
<br />
DPH has two major components: (1) the ''vectorisation transformation'' and (2) the ''generic DPH library for flat parallel arrays''. The vectorisation transformation turns nested into flat data-parallelism and is described in detail in the paper [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] The generic array library maps flat data-parallelism to GHC's multi-threaded multicore support and is described in the paper [http://www.cse.unsw.edu.au/~chak/papers/CLPKM06.html Data Parallel Haskell: a status report]. The same topics are also covered in the slides for the two talks [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf Nested data parallelism in Haskell] and [http://dataparallel.googlegroups.com/web/UNSW%20CGO%20DP%202007.pdf Compiling nested data parallelism by program transformation].<br />
<br />
For further reading, consult this [[GHC/Data Parallel Haskell/References|collection of background papers, and pointers to other people's work]]. If you are really curious and like to know implementation details and the internals of the Data Parallel Haskell project, much of it is described on the GHC developer wiki on the pages covering [http://hackage.haskell.org/trac/ghc/wiki/DataParallel data parallelism] and [http://hackage.haskell.org/trac/ghc/wiki/TypeFunctions type families].<br />
<br />
=== Feedback ===<br />
<br />
Please file bug reports at [http://hackage.haskell.org/trac/ghc/ GHC's bug tracker]. Moreover, comments and suggestions are very welcome. Please post them to the [mailto:glasgow-haskell-users@haskell.org GHC user's mailing list], or contact the DPH developers directly:<br />
* [http://www.cse.unsw.edu.au/~chak/ Manuel Chakravarty]<br />
* [http://www.cse.unsw.edu.au/~keller/ Gabriele Keller]<br />
* [http://www.cse.unsw.edu.au/~rl/ Roman Leshchinksiy]<br />
* [http://research.microsoft.com/~simonpj/ Simon Peyton Jones]</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_Haskell/ReferencesGHC/Data Parallel Haskell/References2008-12-02T13:18:48Z<p>Chak: /* References related to Data Parallel Haskell */</p>
<hr />
<div>== References related to Data Parallel Haskell ==<br />
<br />
Data Parallel Haskell:<br />
* [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] Simon Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. In ''IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2008),'' IBFI, Schloss Dagstuhl, 2008. '''''Summary:''''' ''This paper gives a comprehensive account of the vectorisation of Haskell programs and briefly outlines how vectorisation fits together with the other components of Data Parallel Haskell.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CLPKM07.html Data Parallel Haskell: a status report.] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. In ''DAMP 2007: Workshop on Declarative Aspects of Multicore Programming,'' ACM Press, 2007. '''''Summary:''''' ''Illustrates our approach to implementing nested data parallelism by way of the example of multiplying a sparse matrix with a vector and gives first performance figures. It also includes an overview over the implementation and references to our previous work in the area.'' Here are the [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf slides of a talk] about the paper.<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKLP01.html Nepal -- Nested Data-Parallelism in Haskell.] Manuel M. T. Chakravarty, Gabriele Keller, Roman Lechtchinsky, and Wolf Pfannenstiel. In ''Euro-Par 2001: Parallel Processing, 7th International Euro-Par Conference,'' Springer-Verlag, LNCS 2150, pages 524-534, 2001. '''''Summary:''''' ''Illustrates the language design of integrating support for nested data parallelism into Haskell; in particular, the semantics of parallel arrays and the idea of distinguishing between the parallel and sequential components of a data structure and algorithm by type are introduced. These concepts are illustrated by a parallel version of quicksort, the Barnes-Hut algorithm for solving the n-body problem, and Wang's algorithm to solving tridiagonal systems of linear equations.''<br />
<br />
<br />
Implementing nested data parallelism by program transformation and generic programming:<br />
* [http://www.cse.unsw.edu.au/~chak/papers/SPCS08.html Type Checking with Open Type Functions.] Tom Schrijvers, Simon Peyton-Jones, Manuel M. T. Chakravarty, and Martin Sulzmann. In ''Proceedings of ICFP 2008 : The 13th ACM SIGPLAN International Conference on Functional Programming'', pages 51-62, ACM Press, 2008. '''''Summary:''''' ''This paper describes type checking for type synonym families.'' <br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CLPK07.html Partial Vectorisation of Haskell Programs.] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Gabriele Keller. In ''DAMP 2008: Workshop on Declarative Aspects of Multicore Programming,'' 2008. '''''Summary:''''' ''It addresses the problem that not all code in a program can and should be vectorised – e.g., we do not want to vectorise code involving side effects, such as I/O. To enable mixing vectorised and non-vectorised code, the paper introduces a notion of partial vectorisation of program code.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/LCK06.html Higher Order Flattening.] Roman Leshchinskiy, Manuel M. T. Chakravarty, and Gabriele Keller. In ''Third International Workshop on Practical Aspects of High-level Parallel Programming (PAPP 2006)'', Springer-Verlag, LNCS 3992, 2006. '''''Summary:''''' ''This paper explains how the flattening transformation can be extended to higher-order functions by way of closure conversion and closure inspection. This method was one of the central contributions of Roman Leshchinskiy's PhD thesis.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKPM05.html Associated Types with Class.] Manuel M. T. Chakravarty, Gabriele Keller, Simon Peyton Jones, and Simon Marlow. In ''Proceedings of The 32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'05)'', pages 1-13, ACM Press, 2005. '''''Summary:''''' ''Introduces the idea and type theory of type-indexed data types as type members of Haskell type classes. These associated data types are an essential element of our optimising, non-parametric array implementation.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CK00.html More Types for Nested Data Parallel Programming.] Manuel M. T. Chakravarty and Gabriele Keller. In ''Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming'', pages 94-105, ACM Press, 2000. '''''Summary:''''' ''Extends Blelloch's flattening transformation for nested data parallelism to languages supporting full algebraic data types, including sum types and recursive types. This paper extends flattening for recursive types as introduced in Gabriele Keller's PhD thesis.'' <br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/KC99.html On the Distributed Implementation of Aggregate Data Structures by Program Transformation.] Gabriele Keller and Manuel M. T. Chakravarty. In ''Fourth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'99)'', pages 108-122, Springer Verlag, LNCS 1586, 1999. '''''Summary:''''' ''Presents the idea of supporting transformation-based optimisations, and in particular array and communication fusion, by distinguishing between distributed and local data by type. This method was one of the main contributions of Gabriele Keller's PhD thesis.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CK03.html An approach to fast arrays in Haskell], Manuel M. T. Chakravarty and Gabriele Keller. In Johan Jeuring and Simon Peyton Jones, editors, lecture notes for The Summer School and Workshop on Advanced Functional Programming 2002. LNCS 2638, Springer-Verlag, pages 27-58, 2003. '''''Summary:''''' ''This tutorial paper illustrates the main challenges in implementing sequential high-performance arrays in a lazy functional language. It includes a step-by-step illustration of first-order flattening, discusses implementing non-parametric arrays without associated types, and illustrates a simple approach to equational array fusion. (Data Parallel Haskell uses a more powerful fusion framework based on stream fusion.)''<br />
<br />
* [http://www.cse.unsw.edu.au/~keller/publications/diss_main.ps.gz Transformation-based implementation of nested data parallelism for distributed-memory machines], PhD Thesis, Gabriele Keller, 1999.<br />
<br />
* [http://opus.kobv.de/tuberlin/volltexte/2006/1286/pdf/leshchinskiy_roman.pdf Higher-order nested data parallelism: semantics and implementation], PhD Thesis, Roman Leshchinskiy. This deals in details with the higher-order aspects of NDP.<br />
<br />
<br />
Other languages with nested data parallelism:<br />
* [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.] Guy E. Blelloch. In ''Communications of the ACM'', 39(3), March, 1996. '''''Summary:''''' ''This seminal article illustrates the flexibility and high level of abstraction of nested data parallelism. It also describes the model's language-based cost model.''<br />
* [http://www.cs.cmu.edu/~scandal/nesl.html NESL: A Parallel Programming Language.] '''''Summary:''''' ''This is the main NESL page with many links to programming examples and implementation techniques. The work on NESL did lay the foundations for the programming model of nested data parallelism and is the one most influential precursors of our work.''<br />
* [http://manticore.cs.uchicago.edu/ The Manticore Project.] '''''Summary:''''' ''This is the main page of the Manticore project with many further links. Manticore is a recent effort to develop a heterogeneous parallel programming language targeting multi-core processors, which also includes nested data parallelism in the style of NESL and Data Parallel Haskell.''<br />
* [http://www.cs.unc.edu/Research/proteus/proteus-publications.html Publications of the Proteus project.] '''''Summary:''''' ''Proteus was an effort to develop a heterogeneous parallel language during the high-performance computing era. Most of the actual work on Proteus was actually concerned with its nested data parallel sub-language.''</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_Haskell/ReferencesGHC/Data Parallel Haskell/References2008-12-02T13:18:10Z<p>Chak: /* References related to Data Parallel Haskell */</p>
<hr />
<div>== References related to Data Parallel Haskell ==<br />
<br />
Data Parallel Haskell:<br />
* [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] Simon Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. In ''IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2008),'' IBFI, Schloss Dagstuhl, 2008. '''''Summary:''''' ''This paper gives a comprehensive account of the vectorisation of Haskell programs and briefly outlines how vectorisation fits together with the other components of Data Parallel Haskell.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CLPKM07.html Data Parallel Haskell: a status report.] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. In ''DAMP 2007: Workshop on Declarative Aspects of Multicore Programming,'' ACM Press, 2007. '''''Summary:''''' ''Illustrates our approach to implementing nested data parallelism by way of the example of multiplying a sparse matrix with a vector and gives first performance figures. It also includes an overview over the implementation and references to our previous work in the area.'' Here are the [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf slides of a talk] about the paper.<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKLP01.html Nepal -- Nested Data-Parallelism in Haskell.] Manuel M. T. Chakravarty, Gabriele Keller, Roman Lechtchinsky, and Wolf Pfannenstiel. In ''Euro-Par 2001: Parallel Processing, 7th International Euro-Par Conference,'' Springer-Verlag, LNCS 2150, pages 524-534, 2001. '''''Summary:''''' ''Illustrates the language design of integrating support for nested data parallelism into Haskell; in particular, the semantics of parallel arrays and the idea of distinguishing between the parallel and sequential components of a data structure and algorithm by type are introduced. These concepts are illustrated by a parallel version of quicksort, the Barnes-Hut algorithm for solving the n-body problem, and Wang's algorithm to solving tridiagonal systems of linear equations.''<br />
<br />
<br />
Implementing nested data parallelism by program transformation:<br />
* [http://www.cse.unsw.edu.au/~chak/papers/SPCS08.html Type Checking with Open Type Functions.] Tom Schrijvers, Simon Peyton-Jones, Manuel M. T. Chakravarty, and Martin Sulzmann. In ''Proceedings of ICFP 2008 : The 13th ACM SIGPLAN International Conference on Functional Programming'', pages 51-62, ACM Press, 2008. '''''Summary:''''' ''This paper describes type checking for type synonym families.'' <br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CLPK07.html Partial Vectorisation of Haskell Programs.] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Gabriele Keller. In ''DAMP 2008: Workshop on Declarative Aspects of Multicore Programming,'' 2008. '''''Summary:''''' ''It addresses the problem that not all code in a program can and should be vectorised – e.g., we do not want to vectorise code involving side effects, such as I/O. To enable mixing vectorised and non-vectorised code, the paper introduces a notion of partial vectorisation of program code.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/LCK06.html Higher Order Flattening.] Roman Leshchinskiy, Manuel M. T. Chakravarty, and Gabriele Keller. In ''Third International Workshop on Practical Aspects of High-level Parallel Programming (PAPP 2006)'', Springer-Verlag, LNCS 3992, 2006. '''''Summary:''''' ''This paper explains how the flattening transformation can be extended to higher-order functions by way of closure conversion and closure inspection. This method was one of the central contributions of Roman Leshchinskiy's PhD thesis.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKPM05.html Associated Types with Class.] Manuel M. T. Chakravarty, Gabriele Keller, Simon Peyton Jones, and Simon Marlow. In ''Proceedings of The 32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'05)'', pages 1-13, ACM Press, 2005. '''''Summary:''''' ''Introduces the idea and type theory of type-indexed data types as type members of Haskell type classes. These associated data types are an essential element of our optimising, non-parametric array implementation.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CK00.html More Types for Nested Data Parallel Programming.] Manuel M. T. Chakravarty and Gabriele Keller. In ''Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming'', pages 94-105, ACM Press, 2000. '''''Summary:''''' ''Extends Blelloch's flattening transformation for nested data parallelism to languages supporting full algebraic data types, including sum types and recursive types. This paper extends flattening for recursive types as introduced in Gabriele Keller's PhD thesis.'' <br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/KC99.html On the Distributed Implementation of Aggregate Data Structures by Program Transformation.] Gabriele Keller and Manuel M. T. Chakravarty. In ''Fourth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'99)'', pages 108-122, Springer Verlag, LNCS 1586, 1999. '''''Summary:''''' ''Presents the idea of supporting transformation-based optimisations, and in particular array and communication fusion, by distinguishing between distributed and local data by type. This method was one of the main contributions of Gabriele Keller's PhD thesis.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CK03.html An approach to fast arrays in Haskell], Manuel M. T. Chakravarty and Gabriele Keller. In Johan Jeuring and Simon Peyton Jones, editors, lecture notes for The Summer School and Workshop on Advanced Functional Programming 2002. LNCS 2638, Springer-Verlag, pages 27-58, 2003. '''''Summary:''''' ''This tutorial paper illustrates the main challenges in implementing sequential high-performance arrays in a lazy functional language. It includes a step-by-step illustration of first-order flattening, discusses implementing non-parametric arrays without associated types, and illustrates a simple approach to equational array fusion. (Data Parallel Haskell uses a more powerful fusion framework based on stream fusion.)''<br />
<br />
* [http://www.cse.unsw.edu.au/~keller/publications/diss_main.ps.gz Transformation-based implementation of nested data parallelism for distributed-memory machines], PhD Thesis, Gabriele Keller, 1999.<br />
<br />
* [http://opus.kobv.de/tuberlin/volltexte/2006/1286/pdf/leshchinskiy_roman.pdf Higher-order nested data parallelism: semantics and implementation], PhD Thesis, Roman Leshchinskiy. This deals in details with the higher-order aspects of NDP.<br />
<br />
<br />
Other languages with nested data parallelism:<br />
* [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.] Guy E. Blelloch. In ''Communications of the ACM'', 39(3), March, 1996. '''''Summary:''''' ''This seminal article illustrates the flexibility and high level of abstraction of nested data parallelism. It also describes the model's language-based cost model.''<br />
* [http://www.cs.cmu.edu/~scandal/nesl.html NESL: A Parallel Programming Language.] '''''Summary:''''' ''This is the main NESL page with many links to programming examples and implementation techniques. The work on NESL did lay the foundations for the programming model of nested data parallelism and is the one most influential precursors of our work.''<br />
* [http://manticore.cs.uchicago.edu/ The Manticore Project.] '''''Summary:''''' ''This is the main page of the Manticore project with many further links. Manticore is a recent effort to develop a heterogeneous parallel programming language targeting multi-core processors, which also includes nested data parallelism in the style of NESL and Data Parallel Haskell.''<br />
* [http://www.cs.unc.edu/Research/proteus/proteus-publications.html Publications of the Proteus project.] '''''Summary:''''' ''Proteus was an effort to develop a heterogeneous parallel language during the high-performance computing era. Most of the actual work on Proteus was actually concerned with its nested data parallel sub-language.''</div>Chakhttps://wiki.haskell.org/GHC/Data_Parallel_Haskell/ReferencesGHC/Data Parallel Haskell/References2008-12-02T13:14:01Z<p>Chak: /* References related to Data Parallel Haskell */</p>
<hr />
<div>== References related to Data Parallel Haskell ==<br />
<br />
Data Parallel Haskell:<br />
* [http://www.cse.unsw.edu.au/~chak/papers/PLKC08.html Harnessing the Multicores: Nested Data Parallelism in Haskell.] Simon Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. In ''IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2008),'' IBFI, Schloss Dagstuhl, 2008. '''''Summary:''''' ''This paper gives a comprehensive account of the vectorisation of Haskell programs and briefly outlines how vectorisation fits together with the other components of Data Parallel Haskell.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CLPKM07.html Data Parallel Haskell: a status report.] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. In ''DAMP 2007: Workshop on Declarative Aspects of Multicore Programming,'' ACM Press, 2007. '''''Summary:''''' ''Illustrates our approach to implementing nested data parallelism by way of the example of multiplying a sparse matrix with a vector and gives first performance figures. It also includes an overview over the implementation and references to our previous work in the area.'' Here are the [http://research.microsoft.com/~simonpj/papers/ndp/NdpSlides.pdf slides of a talk] about the paper.<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKLP01.html Nepal -- Nested Data-Parallelism in Haskell.] Manuel M. T. Chakravarty, Gabriele Keller, Roman Lechtchinsky, and Wolf Pfannenstiel. In ''Euro-Par 2001: Parallel Processing, 7th International Euro-Par Conference,'' Springer-Verlag, LNCS 2150, pages 524-534, 2001. '''''Summary:''''' ''Illustrates the language design of integrating support for nested data parallelism into Haskell; in particular, the semantics of parallel arrays and the idea of distinguishing between the parallel and sequential components of a data structure and algorithm by type are introduced. These concepts are illustrated by a parallel version of quicksort, the Barnes-Hut algorithm for solving the n-body problem, and Wang's algorithm to solving tridiagonal systems of linear equations.''<br />
<br />
<br />
Implementing nested data parallelism by program transformation:<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CLPK07.html Partial Vectorisation of Haskell Programs.] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Gabriele Keller. In ''DAMP 2008: Workshop on Declarative Aspects of Multicore Programming,'' 2008. '''''Summary:''''' ''It addresses the problem that not all code in a program can and should be vectorised – e.g., we do not want to vectorise code involving side effects, such as I/O. To enable mixing vectorised and non-vectorised code, the paper introduces a notion of partial vectorisation of program code.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/LCK06.html Higher Order Flattening.] Roman Leshchinskiy, Manuel M. T. Chakravarty, and Gabriele Keller. In ''Third International Workshop on Practical Aspects of High-level Parallel Programming (PAPP 2006)'', Springer-Verlag, LNCS 3992, 2006. '''''Summary:''''' ''This paper explains how the flattening transformation can be extended to higher-order functions by way of closure conversion and closure inspection. This method was one of the central contributions of Roman Leshchinskiy's PhD thesis.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKPM05.html Associated Types with Class.] Manuel M. T. Chakravarty, Gabriele Keller, Simon Peyton Jones, and Simon Marlow. In ''Proceedings of The 32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'05)'', pages 1-13, ACM Press, 2005. '''''Summary:''''' ''Introduces the idea and type theory of type-indexed data types as type members of Haskell type classes. These associated data types are an essential element of our optimising, non-parametric array implementation.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CK00.html More Types for Nested Data Parallel Programming.] Manuel M. T. Chakravarty and Gabriele Keller. In ''Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming'', pages 94-105, ACM Press, 2000. '''''Summary:''''' ''Extends Blelloch's flattening transformation for nested data parallelism to languages supporting full algebraic data types, including sum types and recursive types. This paper extends flattening for recursive types as introduced in Gabriele Keller's PhD thesis.'' <br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/KC99.html On the Distributed Implementation of Aggregate Data Structures by Program Transformation.] Gabriele Keller and Manuel M. T. Chakravarty. In ''Fourth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'99)'', pages 108-122, Springer Verlag, LNCS 1586, 1999. '''''Summary:''''' ''Presents the idea of supporting transformation-based optimisations, and in particular array and communication fusion, by distinguishing between distributed and local data by type. This method was one of the main contributions of Gabriele Keller's PhD thesis.''<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CK03.html An approach to fast arrays in Haskell], Manuel M. T. Chakravarty and Gabriele Keller. In Johan Jeuring and Simon Peyton Jones, editors, lecture notes for The Summer School and Workshop on Advanced Functional Programming 2002. LNCS 2638, Springer-Verlag, pages 27-58, 2003. '''''Summary:''''' ''This tutorial paper illustrates the main challenges in implementing sequential high-performance arrays in a lazy functional language. It includes a step-by-step illustration of first-order flattening, discusses implementing non-parametric arrays without associated types, and illustrates a simple approach to equational array fusion. (Data Parallel Haskell uses a more powerful fusion framework based on stream fusion.)''<br />
<br />
* [http://www.cse.unsw.edu.au/~keller/publications/diss_main.ps.gz Transformation-based implementation of nested data parallelism for distributed-memory machines], PhD Thesis, Gabriele Keller, 1999.<br />
<br />
* [http://opus.kobv.de/tuberlin/volltexte/2006/1286/pdf/leshchinskiy_roman.pdf Higher-order nested data parallelism: semantics and implementation], PhD Thesis, Roman Leshchinskiy. This deals in details with the higher-order aspects of NDP.<br />
<br />
<br />
Other languages with nested data parallelism:<br />
* [http://www.cs.cmu.edu/~scandal/cacm/cacm2.html Programming Parallel Algorithms.] Guy E. Blelloch. In ''Communications of the ACM'', 39(3), March, 1996. '''''Summary:''''' ''This seminal article illustrates the flexibility and high level of abstraction of nested data parallelism. It also describes the model's language-based cost model.''<br />
* [http://www.cs.cmu.edu/~scandal/nesl.html NESL: A Parallel Programming Language.] '''''Summary:''''' ''This is the main NESL page with many links to programming examples and implementation techniques. The work on NESL did lay the foundations for the programming model of nested data parallelism and is the one most influential precursors of our work.''<br />
* [http://manticore.cs.uchicago.edu/ The Manticore Project.] '''''Summary:''''' ''This is the main page of the Manticore project with many further links. Manticore is a recent effort to develop a heterogeneous parallel programming language targeting multi-core processors, which also includes nested data parallelism in the style of NESL and Data Parallel Haskell.''<br />
* [http://www.cs.unc.edu/Research/proteus/proteus-publications.html Publications of the Proteus project.] '''''Summary:''''' ''Proteus was an effort to develop a heterogeneous parallel language during the high-performance computing era. Most of the actual work on Proteus was actually concerned with its nested data parallel sub-language.''</div>Chak