Difference between revisions of "Benchmarks Game/Parallel/BinaryTreesDPH"
< Benchmarks Game | Parallel
Jump to navigation
Jump to search
m (Shootout/Parallel/BinaryTreesDPH moved to Benchmarks Game/Parallel/BinaryTreesDPH: The name of the benchmarks site has changed) |
|||
(5 intermediate revisions by one other user not shown) | |||
Line 7: | Line 7: | ||
* Not submitted, requires GHC 6.10 beta or above |
* Not submitted, requires GHC 6.10 beta or above |
||
+ | Info: |
||
− | Hardware: |
||
* 2.2gHz core2duo, Macbook Pro |
* 2.2gHz core2duo, Macbook Pro |
||
+ | * The Glorious Glasgow Haskell Compilation System, version 6.10.0.20080921 |
||
+ | ====Times==== |
||
− | Build and run: |
||
+ | <b>With a 350mb heap:</b> |
||
− | $ ~/ghc-6.10/bin/ghc --make -fcpr-off -threaded -fdph-par -package dph-base -Odph binarytrees.hs |
||
− | $ ./binarytrees 20 +RTS -N3 -sstderr -A350M |
||
+ | DPH - Single vs. dual core |
||
− | ====Times==== |
||
+ | * Single core (no -N3): 18.945s |
||
− | <b>Note, these times have a 350mb heap (+RTS -A350M)</b> |
||
+ | * Dual core (with -N3): 14.338s (see below) |
||
− | + | Parallel Strategies - Single vs. dual core |
|
− | * Single core (no -N3): |
+ | * Single core (no -N3): 17.191s |
− | * Dual core (with -N3): |
+ | * Dual core (with -N3): 11.936s |
+ | <b>Without a 350mb heap:</b> |
||
− | DPH vs. parallel strategies |
||
− | + | DPH - Single vs. dual core |
|
− | * DPH Dual core: 14.338 (see below) |
||
+ | * Single core (no -N3): 44.503s |
||
− | Comparison with parallel strategies version of parallel binary trees: |
||
+ | * Dual core (with -N3): 39.395s |
||
+ | Parallel Strategies - Single vs. dual core |
||
− | ====Data Parallel Haskell==== |
||
+ | * Single core (no -N3): 37.282s |
||
− | time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS |
||
+ | * Dual core (with -N3): 34.665s |
||
+ | |||
+ | ====Parallel Strategies vs. Data Parallel Haskell==== |
||
+ | |||
+ | <b>DPH</b>: |
||
+ | |||
+ | $ /Users/austinseipp/ghc-6.10/bin/ghc --make -fcpr-off -threaded -fdph-par -package dph-base -Odph binarytrees.hs |
||
+ | $ time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS |
||
./binarytrees 20 +RTS -N3 -sstderr -A350M |
./binarytrees 20 +RTS -N3 -sstderr -A350M |
||
stretch tree of depth 21 check: -1 |
stretch tree of depth 21 check: -1 |
||
Line 96: | Line 105: | ||
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 23.57s user 2.13s system 179% cpu 14.338 total |
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 23.57s user 2.13s system 179% cpu 14.338 total |
||
− | ====Parallel Strategies==== |
||
+ | <b>Parallel Strategies</b>: |
||
+ | |||
+ | $ /Users/austinseipp/ghc-6.10/bin/ghc --make -threaded -O2 binarytrees.hs |
||
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS |
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS |
||
./binarytrees 20 +RTS -N3 -sstderr -A350M |
./binarytrees 20 +RTS -N3 -sstderr -A350M |
||
Line 111: | Line 122: | ||
32 trees of depth 20 check: -32 |
32 trees of depth 20 check: -32 |
||
long lived tree of depth 20 check: -1 |
long lived tree of depth 20 check: -1 |
||
− | + | 9,719,683,964 bytes allocated in the heap |
|
− | + | 142,379,464 bytes copied during GC |
|
− | + | 33,572,136 bytes maximum residency (2 sample(s)) |
|
− | + | 98,928 bytes maximum slop |
|
+ | 1075 MB total memory in use (9 MB lost due to fragmentation) |
||
− | + | Generation 0: 9 collections, 9 parallel, 0.89s, 0.52s elapsed |
|
− | + | Generation 1: 2 collections, 2 parallel, 0.79s, 0.44s elapsed |
|
+ | Parallel GC work balance: 2.40 (35594829 / 14831034, ideal 3) |
||
− | 1094 Mb total memory in use |
||
− | Task 0 (worker) : MUT time: |
+ | Task 0 (worker) : MUT time: 19.50s ( 10.53s elapsed) |
− | GC time: 0. |
+ | GC time: 0.03s ( 0.02s elapsed) |
− | Task 1 (worker) : MUT time: |
+ | Task 1 (worker) : MUT time: 19.50s ( 10.53s elapsed) |
− | GC time: 0. |
+ | GC time: 0.03s ( 0.02s elapsed) |
− | Task 2 (worker) : MUT time: 18. |
+ | Task 2 (worker) : MUT time: 18.11s ( 10.53s elapsed) |
− | GC time: |
+ | GC time: 1.42s ( 0.81s elapsed) |
− | Task 3 (worker) : MUT time: |
+ | Task 3 (worker) : MUT time: 19.54s ( 10.56s elapsed) |
− | GC time: 0.01s ( 0.01s elapsed) |
||
− | |||
− | Task 4 (worker) : MUT time: 18.61s ( 10.68s elapsed) |
||
GC time: 0.00s ( 0.00s elapsed) |
GC time: 0.00s ( 0.00s elapsed) |
||
+ | |||
+ | Task 4 (worker) : MUT time: 19.34s ( 10.56s elapsed) |
||
+ | GC time: 0.20s ( 0.11s elapsed) |
||
INIT time 0.01s ( 0.03s elapsed) |
INIT time 0.01s ( 0.03s elapsed) |
||
− | MUT time |
+ | MUT time 17.85s ( 10.53s elapsed) |
− | GC time |
+ | GC time 1.68s ( 0.96s elapsed) |
EXIT time 0.00s ( 0.01s elapsed) |
EXIT time 0.00s ( 0.01s elapsed) |
||
− | Total time |
+ | Total time 19.55s ( 11.53s elapsed) |
− | %GC time |
+ | %GC time 8.6% (8.4% elapsed) |
+ | |||
+ | Alloc rate 543,958,294 bytes per MUT second |
||
+ | |||
+ | Productivity 91.3% of total user, 154.8% of total elapsed |
||
+ | |||
+ | recordMutableGen_sync: 283 |
||
+ | gc_alloc_block_sync: 12815922 |
||
+ | whitehole_spin: 0 |
||
+ | gen[0].steps[0].sync_todo: 0 |
||
+ | gen[0].steps[0].sync_large_objects: 0 |
||
+ | gen[0].steps[1].sync_todo: 3027 |
||
+ | gen[0].steps[1].sync_large_objects: 0 |
||
+ | gen[1].steps[0].sync_todo: 234266 |
||
+ | gen[1].steps[0].sync_large_objects: 0 |
||
+ | ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 19.55s user 2.13s system 185% cpu 11.690 total |
||
+ | ===Code=== |
||
− | Alloc rate 537,291,716 bytes per MUT second |
||
+ | <haskell> |
||
+ | {-# LANGUAGE BangPatterns, TypeOperators #-} |
||
+ | {-# OPTIONS -funbox-strict-fields #-} |
||
+ | -- |
||
+ | -- The Computer Language Shootout |
||
+ | -- http://shootout.alioth.debian.org/ |
||
+ | -- |
||
+ | -- Contributed by Don Stewart and Thomas Davie, modified for DPH |
||
+ | -- usage by Austin Seipp |
||
+ | -- |
||
+ | -- This uses the Data Parallel Haskell beta available in ghc 6.10, to |
||
+ | -- exploit multi-core machines. |
||
+ | -- |
||
+ | |||
+ | import System |
||
+ | import Data.Bits |
||
+ | import Text.Printf |
||
+ | import Data.Array.Parallel.Base |
||
+ | import Data.Array.Parallel.Unlifted as U |
||
+ | import Data.Array.Parallel.Unlifted.Parallel as UP |
||
+ | -- |
||
+ | -- an artificially strict tree. |
||
+ | -- |
||
+ | -- normally you would ensure the branches are lazy, but this benchmark |
||
+ | -- requires strict allocation. |
||
+ | -- |
||
+ | data Tree = Nil | Node !Int !Tree !Tree |
||
+ | |||
+ | minN = 4 |
||
+ | |||
+ | io s n t = printf "%s of depth %d\t check: %d\n" s n t |
||
+ | |||
+ | main = do |
||
+ | n <- getArgs >>= readIO . head |
||
+ | let maxN = max (minN + 2) n |
||
+ | stretchN = maxN + 1 |
||
+ | |||
+ | -- stretch memory tree |
||
+ | let c = check' (make 0 stretchN) |
||
+ | io "stretch tree" stretchN c |
||
+ | |||
+ | -- allocate a long lived tree |
||
+ | let !long = make 0 maxN |
||
+ | |||
+ | -- allocate, walk, and deallocate many bottom-up binary trees |
||
+ | let vs = UP.mapUP (depth' maxN) $ UP.enumFromThenToUP minN (minN+2) maxN |
||
+ | mapM_ (\((m :*:d :*: i)) -> |
||
+ | io (show m ++ "\t trees") d i) (U.toList vs) |
||
+ | |||
+ | -- confirm the the long-lived binary tree still exists |
||
+ | io "long lived tree" maxN (check' long) |
||
+ | |||
+ | -- generate many trees |
||
+ | depth' :: Int -> Int -> (Int :*: Int :*: Int) |
||
+ | depth' m d = |
||
+ | (2*n) :*: d :*: (sumT d n 0) |
||
+ | where |
||
+ | n = 1 `shiftL` (m - d + minN) |
||
+ | |||
+ | -- allocate and check lots of trees |
||
+ | sumT :: Int -> Int -> Int -> Int |
||
+ | sumT d 0 t = t |
||
+ | sumT d i t = sumT d (i-1) (t + a + b) |
||
+ | where a = check' (make i d) |
||
+ | b = check' (make (-i) d) |
||
+ | |||
+ | -- traverse the tree, counting up the nodes |
||
+ | check' :: Tree -> Int |
||
+ | check' Nil = 0 |
||
+ | check' (Node i l r) = i + check' l - check' r |
||
+ | |||
+ | -- build a tree |
||
+ | make :: Int -> Int -> Tree |
||
+ | make i 0 = Node i Nil Nil |
||
+ | make i d = Node i (make (i2-1) d2) (make i2 d2) |
||
+ | where i2 = 2*i; d2 = d-1 |
||
+ | </haskell> |
||
+ | ===Notes=== |
||
− | Productivity 97.1% of total user, 159.8% of total elapsed |
||
+ | I'm taking a guess (core is tremendous) that an issue potentially lies here: |
||
− | ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 18.62s user 2.15s system 181% cpu 11.471 total |
||
+ | <haskell> mapM_ (\((m :*:d :*: i)) -> |
||
+ | io (show m ++ "\t trees") d i) (U.toList vs)</haskell> |
||
+ | Not sure if U.toList fuses properly - but how can we traverse it otherwise? |
Latest revision as of 22:27, 22 January 2012
Binary Trees
Data Parallel Haskell
- Not submitted, requires GHC 6.10 beta or above
Info:
- 2.2gHz core2duo, Macbook Pro
- The Glorious Glasgow Haskell Compilation System, version 6.10.0.20080921
Times
With a 350mb heap:
DPH - Single vs. dual core
- Single core (no -N3): 18.945s
- Dual core (with -N3): 14.338s (see below)
Parallel Strategies - Single vs. dual core
- Single core (no -N3): 17.191s
- Dual core (with -N3): 11.936s
Without a 350mb heap:
DPH - Single vs. dual core
- Single core (no -N3): 44.503s
- Dual core (with -N3): 39.395s
Parallel Strategies - Single vs. dual core
- Single core (no -N3): 37.282s
- Dual core (with -N3): 34.665s
Parallel Strategies vs. Data Parallel Haskell
DPH:
$ /Users/austinseipp/ghc-6.10/bin/ghc --make -fcpr-off -threaded -fdph-par -package dph-base -Odph binarytrees.hs $ time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS ./binarytrees 20 +RTS -N3 -sstderr -A350M stretch tree of depth 21 check: -1 2097152 trees of depth 4 check: -2097152 524288 trees of depth 6 check: -524288 131072 trees of depth 8 check: -131072 32768 trees of depth 10 check: -32768 8192 trees of depth 12 check: -8192 2048 trees of depth 14 check: -2048 512 trees of depth 16 check: -512 128 trees of depth 18 check: -128 32 trees of depth 20 check: -32 long lived tree of depth 20 check: -1 14,584,449,064 bytes allocated in the heap 215,093,024 bytes copied during GC 34,674,344 bytes maximum residency (2 sample(s)) 51,688 bytes maximum slop 1083 MB total memory in use (9 MB lost due to fragmentation)
Generation 0: 16 collections, 16 parallel, 1.97s, 1.13s elapsed Generation 1: 2 collections, 2 parallel, 0.83s, 0.47s elapsed
Parallel GC work balance: 2.34 (53773189 / 22969253, ideal 3)
Task 0 (worker) : MUT time: 23.18s ( 12.54s elapsed) GC time: 0.37s ( 0.21s elapsed)
Task 1 (worker) : MUT time: 21.46s ( 12.54s elapsed) GC time: 2.09s ( 1.19s elapsed)
Task 2 (worker) : MUT time: 23.53s ( 12.54s elapsed) GC time: 0.02s ( 0.02s elapsed)
Task 3 (worker) : MUT time: 23.56s ( 12.57s elapsed) GC time: 0.00s ( 0.00s elapsed)
Task 4 (worker) : MUT time: 23.24s ( 12.57s elapsed) GC time: 0.32s ( 0.18s elapsed)
INIT time 0.02s ( 0.03s elapsed) MUT time 20.75s ( 12.54s elapsed) GC time 2.80s ( 1.60s elapsed) EXIT time 0.00s ( 0.02s elapsed) Total time 23.57s ( 14.17s elapsed)
%GC time 11.9% (11.3% elapsed)
Alloc rate 702,222,726 bytes per MUT second
Productivity 88.1% of total user, 146.4% of total elapsed
recordMutableGen_sync: 0 gc_alloc_block_sync: 14318885 whitehole_spin: 0 gen[0].steps[0].sync_todo: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_todo: 354 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_todo: 10112 gen[1].steps[0].sync_large_objects: 0 ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 23.57s user 2.13s system 179% cpu 14.338 total
Parallel Strategies:
$ /Users/austinseipp/ghc-6.10/bin/ghc --make -threaded -O2 binarytrees.hs time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS ./binarytrees 20 +RTS -N3 -sstderr -A350M stretch tree of depth 21 check: -1 2097152 trees of depth 4 check: -2097152 524288 trees of depth 6 check: -524288 131072 trees of depth 8 check: -131072 32768 trees of depth 10 check: -32768 8192 trees of depth 12 check: -8192 2048 trees of depth 14 check: -2048 512 trees of depth 16 check: -512 128 trees of depth 18 check: -128 32 trees of depth 20 check: -32 long lived tree of depth 20 check: -1 9,719,683,964 bytes allocated in the heap 142,379,464 bytes copied during GC 33,572,136 bytes maximum residency (2 sample(s)) 98,928 bytes maximum slop 1075 MB total memory in use (9 MB lost due to fragmentation)
Generation 0: 9 collections, 9 parallel, 0.89s, 0.52s elapsed Generation 1: 2 collections, 2 parallel, 0.79s, 0.44s elapsed
Parallel GC work balance: 2.40 (35594829 / 14831034, ideal 3)
Task 0 (worker) : MUT time: 19.50s ( 10.53s elapsed) GC time: 0.03s ( 0.02s elapsed)
Task 1 (worker) : MUT time: 19.50s ( 10.53s elapsed) GC time: 0.03s ( 0.02s elapsed)
Task 2 (worker) : MUT time: 18.11s ( 10.53s elapsed) GC time: 1.42s ( 0.81s elapsed)
Task 3 (worker) : MUT time: 19.54s ( 10.56s elapsed) GC time: 0.00s ( 0.00s elapsed)
Task 4 (worker) : MUT time: 19.34s ( 10.56s elapsed) GC time: 0.20s ( 0.11s elapsed)
INIT time 0.01s ( 0.03s elapsed) MUT time 17.85s ( 10.53s elapsed) GC time 1.68s ( 0.96s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 19.55s ( 11.53s elapsed)
%GC time 8.6% (8.4% elapsed)
Alloc rate 543,958,294 bytes per MUT second
Productivity 91.3% of total user, 154.8% of total elapsed
recordMutableGen_sync: 283 gc_alloc_block_sync: 12815922 whitehole_spin: 0 gen[0].steps[0].sync_todo: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_todo: 3027 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_todo: 234266 gen[1].steps[0].sync_large_objects: 0 ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 19.55s user 2.13s system 185% cpu 11.690 total
Code
{-# LANGUAGE BangPatterns, TypeOperators #-}
{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Don Stewart and Thomas Davie, modified for DPH
-- usage by Austin Seipp
--
-- This uses the Data Parallel Haskell beta available in ghc 6.10, to
-- exploit multi-core machines.
--
import System
import Data.Bits
import Text.Printf
import Data.Array.Parallel.Base
import Data.Array.Parallel.Unlifted as U
import Data.Array.Parallel.Unlifted.Parallel as UP
--
-- an artificially strict tree.
--
-- normally you would ensure the branches are lazy, but this benchmark
-- requires strict allocation.
--
data Tree = Nil | Node !Int !Tree !Tree
minN = 4
io s n t = printf "%s of depth %d\t check: %d\n" s n t
main = do
n <- getArgs >>= readIO . head
let maxN = max (minN + 2) n
stretchN = maxN + 1
-- stretch memory tree
let c = check' (make 0 stretchN)
io "stretch tree" stretchN c
-- allocate a long lived tree
let !long = make 0 maxN
-- allocate, walk, and deallocate many bottom-up binary trees
let vs = UP.mapUP (depth' maxN) $ UP.enumFromThenToUP minN (minN+2) maxN
mapM_ (\((m :*:d :*: i)) ->
io (show m ++ "\t trees") d i) (U.toList vs)
-- confirm the the long-lived binary tree still exists
io "long lived tree" maxN (check' long)
-- generate many trees
depth' :: Int -> Int -> (Int :*: Int :*: Int)
depth' m d =
(2*n) :*: d :*: (sumT d n 0)
where
n = 1 `shiftL` (m - d + minN)
-- allocate and check lots of trees
sumT :: Int -> Int -> Int -> Int
sumT d 0 t = t
sumT d i t = sumT d (i-1) (t + a + b)
where a = check' (make i d)
b = check' (make (-i) d)
-- traverse the tree, counting up the nodes
check' :: Tree -> Int
check' Nil = 0
check' (Node i l r) = i + check' l - check' r
-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
where i2 = 2*i; d2 = d-1
Notes
I'm taking a guess (core is tremendous) that an issue potentially lies here:
mapM_ (\((m :*:d :*: i)) ->
io (show m ++ "\t trees") d i) (U.toList vs)
Not sure if U.toList fuses properly - but how can we traverse it otherwise?