Benchmarks Game/Parallel/BinaryTreesDPH

From HaskellWiki

Binary Trees

Data Parallel Haskell

  • Not submitted, requires GHC 6.10 beta or above

Info:

  • 2.2gHz core2duo, Macbook Pro
  • The Glorious Glasgow Haskell Compilation System, version 6.10.0.20080921

Times

With a 350mb heap:

DPH - Single vs. dual core

  • Single core (no -N3): 18.945s
  • Dual core (with -N3): 14.338s (see below)

Parallel Strategies - Single vs. dual core

  • Single core (no -N3): 17.191s
  • Dual core (with -N3): 11.936s

Without a 350mb heap:

DPH - Single vs. dual core

  • Single core (no -N3): 44.503s
  • Dual core (with -N3): 39.395s

Parallel Strategies - Single vs. dual core

  • Single core (no -N3): 37.282s
  • Dual core (with -N3): 34.665s

Parallel Strategies vs. Data Parallel Haskell

DPH:

$ /Users/austinseipp/ghc-6.10/bin/ghc --make -fcpr-off -threaded -fdph-par -package dph-base -Odph  binarytrees.hs
$ time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
./binarytrees 20 +RTS -N3 -sstderr -A350M 
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
 14,584,449,064 bytes allocated in the heap
    215,093,024 bytes copied during GC
     34,674,344 bytes maximum residency (2 sample(s))
         51,688 bytes maximum slop
           1083 MB total memory in use (9 MB lost due to fragmentation)
 Generation 0:    16 collections,    16 parallel,  1.97s,  1.13s elapsed
 Generation 1:     2 collections,     2 parallel,  0.83s,  0.47s elapsed
 Parallel GC work balance: 2.34 (53773189 / 22969253, ideal 3)
 Task  0 (worker) :  MUT time:  23.18s  ( 12.54s elapsed)
                     GC  time:   0.37s  (  0.21s elapsed)
 Task  1 (worker) :  MUT time:  21.46s  ( 12.54s elapsed)
                     GC  time:   2.09s  (  1.19s elapsed)
 Task  2 (worker) :  MUT time:  23.53s  ( 12.54s elapsed)
                     GC  time:   0.02s  (  0.02s elapsed)
 Task  3 (worker) :  MUT time:  23.56s  ( 12.57s elapsed)
                     GC  time:   0.00s  (  0.00s elapsed)
 Task  4 (worker) :  MUT time:  23.24s  ( 12.57s elapsed)
                     GC  time:   0.32s  (  0.18s elapsed)
 INIT  time    0.02s  (  0.03s elapsed)
 MUT   time   20.75s  ( 12.54s elapsed)
 GC    time    2.80s  (  1.60s elapsed)
 EXIT  time    0.00s  (  0.02s elapsed)
 Total time   23.57s  ( 14.17s elapsed)
 %GC time      11.9%  (11.3% elapsed)
 Alloc rate    702,222,726 bytes per MUT second
 Productivity  88.1% of total user, 146.4% of total elapsed
recordMutableGen_sync: 0
gc_alloc_block_sync: 14318885
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 354
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 10112
gen[1].steps[0].sync_large_objects: 0
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS  23.57s user 2.13s system 179% cpu 14.338 total


Parallel Strategies:

$ /Users/austinseipp/ghc-6.10/bin/ghc --make -threaded -O2 binarytrees.hs
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
./binarytrees 20 +RTS -N3 -sstderr -A350M 
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
  9,719,683,964 bytes allocated in the heap
    142,379,464 bytes copied during GC
     33,572,136 bytes maximum residency (2 sample(s))
         98,928 bytes maximum slop
           1075 MB total memory in use (9 MB lost due to fragmentation)
 Generation 0:     9 collections,     9 parallel,  0.89s,  0.52s elapsed
 Generation 1:     2 collections,     2 parallel,  0.79s,  0.44s elapsed
 Parallel GC work balance: 2.40 (35594829 / 14831034, ideal 3)
 Task  0 (worker) :  MUT time:  19.50s  ( 10.53s elapsed)
                     GC  time:   0.03s  (  0.02s elapsed)
 Task  1 (worker) :  MUT time:  19.50s  ( 10.53s elapsed)
                     GC  time:   0.03s  (  0.02s elapsed)
 Task  2 (worker) :  MUT time:  18.11s  ( 10.53s elapsed)
                     GC  time:   1.42s  (  0.81s elapsed)
 Task  3 (worker) :  MUT time:  19.54s  ( 10.56s elapsed)
                     GC  time:   0.00s  (  0.00s elapsed)
 Task  4 (worker) :  MUT time:  19.34s  ( 10.56s elapsed)
                     GC  time:   0.20s  (  0.11s elapsed)
 INIT  time    0.01s  (  0.03s elapsed)
 MUT   time   17.85s  ( 10.53s elapsed)
 GC    time    1.68s  (  0.96s elapsed)
 EXIT  time    0.00s  (  0.01s elapsed)
 Total time   19.55s  ( 11.53s elapsed)
 %GC time       8.6%  (8.4% elapsed)
 Alloc rate    543,958,294 bytes per MUT second
 Productivity  91.3% of total user, 154.8% of total elapsed
recordMutableGen_sync: 283
gc_alloc_block_sync: 12815922
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 3027
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 234266
gen[1].steps[0].sync_large_objects: 0
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS  19.55s user 2.13s system 185% cpu 11.690 total

Code

{-# LANGUAGE BangPatterns, TypeOperators #-}
{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Don Stewart and Thomas Davie, modified for DPH
-- usage by Austin Seipp
--
-- This uses the Data Parallel Haskell beta available in ghc 6.10, to
-- exploit multi-core machines.
-- 
 
import System
import Data.Bits
import Text.Printf
import Data.Array.Parallel.Base 
import Data.Array.Parallel.Unlifted as U
import Data.Array.Parallel.Unlifted.Parallel as UP
--
-- an artificially strict tree.
--
-- normally you would ensure the branches are lazy, but this benchmark
-- requires strict allocation.
--
data Tree = Nil | Node !Int !Tree !Tree
 
minN = 4
 
io s n t = printf "%s of depth %d\t check: %d\n" s n t
 
main = do
    n <- getArgs >>= readIO . head
    let maxN     = max (minN + 2) n
        stretchN = maxN + 1
 
    -- stretch memory tree
    let c = check' (make 0 stretchN)
    io "stretch tree" stretchN c
 
    -- allocate a long lived tree
    let !long    = make 0 maxN
 
    -- allocate, walk, and deallocate many bottom-up binary trees
    let vs = UP.mapUP (depth' maxN) $ UP.enumFromThenToUP minN (minN+2) maxN
    mapM_ (\((m :*:d :*: i)) -> 
               io (show m ++ "\t trees") d i) (U.toList vs)
 
    -- confirm the the long-lived binary tree still exists
    io "long lived tree" maxN (check' long)
 
-- generate many trees
depth' :: Int -> Int -> (Int :*: Int :*: Int)
depth' m d =
  (2*n) :*: d :*: (sumT d n 0)
  where
    n = 1 `shiftL` (m - d + minN)
 
-- allocate and check lots of trees
sumT :: Int -> Int -> Int -> Int
sumT d 0 t = t
sumT  d i t = sumT d (i-1) (t + a + b)
  where a = check' (make i    d)
        b = check' (make (-i) d)
 
-- traverse the tree, counting up the nodes
check' :: Tree -> Int
check' Nil          = 0
check' (Node i l r) = i + check' l - check' r
 
-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
  where i2 = 2*i; d2 = d-1

Notes

I'm taking a guess (core is tremendous) that an issue potentially lies here:

    mapM_ (\((m :*:d :*: i)) -> 
               io (show m ++ "\t trees") d i) (U.toList vs)

Not sure if U.toList fuses properly - but how can we traverse it otherwise?