Benchmarks Game/Parallel/BinaryTreesDPH
< Benchmarks Game | Parallel
Binary Trees
Data Parallel Haskell
- Not submitted, requires GHC 6.10 beta or above
Info:
- 2.2gHz core2duo, Macbook Pro
- The Glorious Glasgow Haskell Compilation System, version 6.10.0.20080921
Times
With a 350mb heap:
DPH - Single vs. dual core
- Single core (no -N3): 18.945s
- Dual core (with -N3): 14.338s (see below)
Parallel Strategies - Single vs. dual core
- Single core (no -N3): 17.191s
- Dual core (with -N3): 11.936s
Without a 350mb heap:
DPH - Single vs. dual core
- Single core (no -N3): 44.503s
- Dual core (with -N3): 39.395s
Parallel Strategies - Single vs. dual core
- Single core (no -N3): 37.282s
- Dual core (with -N3): 34.665s
Parallel Strategies vs. Data Parallel Haskell
DPH:
$ /Users/austinseipp/ghc-6.10/bin/ghc --make -fcpr-off -threaded -fdph-par -package dph-base -Odph binarytrees.hs $ time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS ./binarytrees 20 +RTS -N3 -sstderr -A350M stretch tree of depth 21 check: -1 2097152 trees of depth 4 check: -2097152 524288 trees of depth 6 check: -524288 131072 trees of depth 8 check: -131072 32768 trees of depth 10 check: -32768 8192 trees of depth 12 check: -8192 2048 trees of depth 14 check: -2048 512 trees of depth 16 check: -512 128 trees of depth 18 check: -128 32 trees of depth 20 check: -32 long lived tree of depth 20 check: -1 14,584,449,064 bytes allocated in the heap 215,093,024 bytes copied during GC 34,674,344 bytes maximum residency (2 sample(s)) 51,688 bytes maximum slop 1083 MB total memory in use (9 MB lost due to fragmentation)
Generation 0: 16 collections, 16 parallel, 1.97s, 1.13s elapsed Generation 1: 2 collections, 2 parallel, 0.83s, 0.47s elapsed
Parallel GC work balance: 2.34 (53773189 / 22969253, ideal 3)
Task 0 (worker) : MUT time: 23.18s ( 12.54s elapsed) GC time: 0.37s ( 0.21s elapsed)
Task 1 (worker) : MUT time: 21.46s ( 12.54s elapsed) GC time: 2.09s ( 1.19s elapsed)
Task 2 (worker) : MUT time: 23.53s ( 12.54s elapsed) GC time: 0.02s ( 0.02s elapsed)
Task 3 (worker) : MUT time: 23.56s ( 12.57s elapsed) GC time: 0.00s ( 0.00s elapsed)
Task 4 (worker) : MUT time: 23.24s ( 12.57s elapsed) GC time: 0.32s ( 0.18s elapsed)
INIT time 0.02s ( 0.03s elapsed) MUT time 20.75s ( 12.54s elapsed) GC time 2.80s ( 1.60s elapsed) EXIT time 0.00s ( 0.02s elapsed) Total time 23.57s ( 14.17s elapsed)
%GC time 11.9% (11.3% elapsed)
Alloc rate 702,222,726 bytes per MUT second
Productivity 88.1% of total user, 146.4% of total elapsed
recordMutableGen_sync: 0 gc_alloc_block_sync: 14318885 whitehole_spin: 0 gen[0].steps[0].sync_todo: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_todo: 354 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_todo: 10112 gen[1].steps[0].sync_large_objects: 0 ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 23.57s user 2.13s system 179% cpu 14.338 total
Parallel Strategies:
$ /Users/austinseipp/ghc-6.10/bin/ghc --make -threaded -O2 binarytrees.hs time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS ./binarytrees 20 +RTS -N3 -sstderr -A350M stretch tree of depth 21 check: -1 2097152 trees of depth 4 check: -2097152 524288 trees of depth 6 check: -524288 131072 trees of depth 8 check: -131072 32768 trees of depth 10 check: -32768 8192 trees of depth 12 check: -8192 2048 trees of depth 14 check: -2048 512 trees of depth 16 check: -512 128 trees of depth 18 check: -128 32 trees of depth 20 check: -32 long lived tree of depth 20 check: -1 9,719,683,964 bytes allocated in the heap 142,379,464 bytes copied during GC 33,572,136 bytes maximum residency (2 sample(s)) 98,928 bytes maximum slop 1075 MB total memory in use (9 MB lost due to fragmentation)
Generation 0: 9 collections, 9 parallel, 0.89s, 0.52s elapsed Generation 1: 2 collections, 2 parallel, 0.79s, 0.44s elapsed
Parallel GC work balance: 2.40 (35594829 / 14831034, ideal 3)
Task 0 (worker) : MUT time: 19.50s ( 10.53s elapsed) GC time: 0.03s ( 0.02s elapsed)
Task 1 (worker) : MUT time: 19.50s ( 10.53s elapsed) GC time: 0.03s ( 0.02s elapsed)
Task 2 (worker) : MUT time: 18.11s ( 10.53s elapsed) GC time: 1.42s ( 0.81s elapsed)
Task 3 (worker) : MUT time: 19.54s ( 10.56s elapsed) GC time: 0.00s ( 0.00s elapsed)
Task 4 (worker) : MUT time: 19.34s ( 10.56s elapsed) GC time: 0.20s ( 0.11s elapsed)
INIT time 0.01s ( 0.03s elapsed) MUT time 17.85s ( 10.53s elapsed) GC time 1.68s ( 0.96s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 19.55s ( 11.53s elapsed)
%GC time 8.6% (8.4% elapsed)
Alloc rate 543,958,294 bytes per MUT second
Productivity 91.3% of total user, 154.8% of total elapsed
recordMutableGen_sync: 283 gc_alloc_block_sync: 12815922 whitehole_spin: 0 gen[0].steps[0].sync_todo: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_todo: 3027 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_todo: 234266 gen[1].steps[0].sync_large_objects: 0 ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 19.55s user 2.13s system 185% cpu 11.690 total
Code
{-# LANGUAGE BangPatterns, TypeOperators #-}
{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Don Stewart and Thomas Davie, modified for DPH
-- usage by Austin Seipp
--
-- This uses the Data Parallel Haskell beta available in ghc 6.10, to
-- exploit multi-core machines.
--
import System
import Data.Bits
import Text.Printf
import Data.Array.Parallel.Base
import Data.Array.Parallel.Unlifted as U
import Data.Array.Parallel.Unlifted.Parallel as UP
--
-- an artificially strict tree.
--
-- normally you would ensure the branches are lazy, but this benchmark
-- requires strict allocation.
--
data Tree = Nil | Node !Int !Tree !Tree
minN = 4
io s n t = printf "%s of depth %d\t check: %d\n" s n t
main = do
n <- getArgs >>= readIO . head
let maxN = max (minN + 2) n
stretchN = maxN + 1
-- stretch memory tree
let c = check' (make 0 stretchN)
io "stretch tree" stretchN c
-- allocate a long lived tree
let !long = make 0 maxN
-- allocate, walk, and deallocate many bottom-up binary trees
let vs = UP.mapUP (depth' maxN) $ UP.enumFromThenToUP minN (minN+2) maxN
mapM_ (\((m :*:d :*: i)) ->
io (show m ++ "\t trees") d i) (U.toList vs)
-- confirm the the long-lived binary tree still exists
io "long lived tree" maxN (check' long)
-- generate many trees
depth' :: Int -> Int -> (Int :*: Int :*: Int)
depth' m d =
(2*n) :*: d :*: (sumT d n 0)
where
n = 1 `shiftL` (m - d + minN)
-- allocate and check lots of trees
sumT :: Int -> Int -> Int -> Int
sumT d 0 t = t
sumT d i t = sumT d (i-1) (t + a + b)
where a = check' (make i d)
b = check' (make (-i) d)
-- traverse the tree, counting up the nodes
check' :: Tree -> Int
check' Nil = 0
check' (Node i l r) = i + check' l - check' r
-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
where i2 = 2*i; d2 = d-1
Notes
I'm taking a guess (core is tremendous) that an issue potentially lies here:
mapM_ (\((m :*:d :*: i)) ->
io (show m ++ "\t trees") d i) (U.toList vs)
Not sure if U.toList fuses properly - but how can we traverse it otherwise?