https://wiki.haskell.org/api.php?action=feedcontributions&user=Libby&feedformat=atomHaskellWiki - User contributions [en]2020-01-20T00:35:32ZUser contributionsMediaWiki 1.27.4https://wiki.haskell.org/index.php?title=Performance/GHC&diff=61651Performance/GHC2017-03-23T00:39:26Z<p>Libby: /* Measuring performance */</p>
<hr />
<div>{{Performance infobox}}<br />
[[Category:Performance|GHC]] [[Category:GHC]]<br />
Please report any overly-slow GHC-compiled programs. Since [[GHC]] doesn't have any credible competition in the performance department these days it's hard to say what overly-slow means, so just use your judgement! Of course, if a GHC compiled program runs slower than the same program compiled with another Haskell compiler, then it's definitely a bug. Furthermore, if an equivalent OCaml, SML or Clean program is faster, this ''might'' be a bug.<br />
<br />
== Use optimisation ==<br />
<br />
Optimise, using <tt>-O</tt> or <tt>-O2</tt>: this is the most basic way to make your program go faster. Compilation time will be slower, especially with <tt>-O2</tt>.<br />
<br />
At present, <tt>-O2</tt> is nearly indistinguishable from <tt>-O</tt>.<br />
<br />
GHCi cannot optimise interpreted code, so when using GHCi, compile critical modules using <tt>-O</tt> or <tt>-O2</tt>, then load them into GHCi.<br />
<br />
Here is a short summary of useful compile time flags:<br />
* <tt>-O</tt>:<br />
* <tt>-O2</tt>:<br />
* <tt>-funfolding-use-threshold=16</tt>: demand more inlining.<br />
* <tt>-fexcess-precision</tt>: see [[Performance/Floating_point]]<br />
* <tt>-optc-O3</tt>: Enables a suite of optimizations in the GCC compiler. See the [http://www.openbsd.org/cgi-bin/man.cgi?query=gcc&sektion=1 gcc(1) man-page] for details. (a C-compiler option).<br />
* <tt>-optc-ffast-math</tt>: A C-compiler option which allows it to be less strict with respect to the standard when compiling IEEE 754 floating point arithmetic. Math operations will not trap if something goes wrong and math operations will assume that NaN and +- Infinity are not in arguments or results. For most practical floating point processing, this is a non-issue and enabling the flag can speed up FP arithmetic by a considerable amount. Also see the gcc(1) man-page. (a C-compiler option).<br />
<br />
Other useful flags:<br />
* <tt>-ddump-simpl > core.txt</tt>: generate core.txt file (see below).<br />
<br />
<br />
== Measuring performance ==<br />
<br />
The first thing to do is measure the performance of your program, and find out whether all the time is being spent in the garbage collector or not. Run your program with the <tt>+RTS -sstderr</tt> option:<br />
<br />
$ ./clausify 20 +RTS -sstderr<br />
42,764,972 bytes allocated in the heap<br />
6,915,348 bytes copied during GC (scavenged)<br />
360,448 bytes copied during GC (not scavenged)<br />
36,616 bytes maximum residency (7 sample(s))<br />
<br />
81 collections in generation 0 ( 0.07s)<br />
7 collections in generation 1 ( 0.00s)<br />
<br />
2 Mb total memory in use<br />
<br />
INIT time 0.00s ( 0.00s elapsed)<br />
MUT time 0.65s ( 0.94s elapsed)<br />
GC time 0.07s ( 0.06s elapsed)<br />
EXIT time 0.00s ( 0.00s elapsed)<br />
Total time 0.72s ( 1.00s elapsed)<br />
<br />
%GC time 9.7% (6.0% elapsed)<br />
<br />
Alloc rate 65,792,264 bytes per MUT second<br />
<br />
Productivity 90.3% of total user, 65.1% of total elapsed<br />
<br />
{{Note|Hint: You can use [[ThreadScope]] to visualize GHC's output.}}<br />
<br />
This tells you how much time is being spent running the program itself (MUT time), and how much time spent in the garbage collector (GC time). <br />
<br />
If your program is doing a lot of GC, then your first priority should be to check for [[Memory leak|Space Leaks]] using [https://downloads.haskell.org/~ghc/7.0.3/docs/html/users_guide/prof-heap.html heap profiling], and then to try to reduce allocations by [https://downloads.haskell.org/~ghc/7.0.3/docs/html/users_guide/prof-time-options.html time and allocation profiling]. <br />
<br />
If you can't reduce the GC cost any further, then using more memory by tweaking the [http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control.html#rts-options-gc GC options] will probably help. For example, increasing the default heap size with <tt>+RTS -H128m</tt> will reduce the number of GCs.<br />
<br />
If your program isn't doing too much GC, then you should proceed to [http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-time-options.html time and allocation profiling] to see where the big hitters are.<br />
<br />
== Modules and separate compilation ==<br />
<br />
In general, splitting code across modules should not make programs less efficient. GHC does quite aggressive cross-module inlining: when you import a function f from another module M, GHC consults the "interface file" M.hi to get f's definition.<br />
<br />
For best results, ''use an explicit export list''. If you do, GHC can inline any non-exported functions that are only called once, even if they are very big. Without an explicit export list, GHC must assume that every function is exported, and hence (to avoid code bloat) is more conservative about inlining.<br />
<br />
There is one exception to the general rule that splitting code across modules does not harm performance. As mentioned above, if a non-exported non-recursive function is called exactly once, then it is inlined ''regardless of size'', because doing so does not cause code duplication. But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times. You can change the threshold for (a) exposing and (b) using an inlining, with flags <tt>-funfolding-creation-threshold</tt> and <tt>-funfolding-use-threshold</tt> respectively.<br />
<br />
== Unboxed types ==<br />
<br />
When you are ''really'' desperate for speed, and you want to get right down to the &ldquo;raw bits.&rdquo; Please see [http://www.haskell.org/ghc/docs/latest/html/users_guide/primitives.html GHC Primitives] for some information about using unboxed types.<br />
<br />
This should be a last resort, however, since unboxed types and primitives are non-portable. Fortunately, it is usually not necessary to resort to using explicit unboxed types and primitives, because GHC's optimiser can do the work for you by inlining operations it knows about, and unboxing strict function arguments (see [[Performance/Strictness]]). Strict and unpacked constructor fields can also help a lot (see [[Performance/Data Types]]). Sometimes GHC needs a little help to generate the right code, so you might have to look at the Core output to see whether your tweaks are actually having the desired effect.<br />
<br />
One thing that can be said for using unboxed types and primitives is that you ''know'' you're writing efficient code, rather than relying on GHC's optimiser to do the right thing, and being at the mercy of changes in GHC's optimiser down the line. This may well be important to you, in which case go for it.<br />
<br />
=== An example ===<br />
<br />
Usually unboxing is not explicitly required (see the Core tutorial below), however there<br />
are circumstances where you require precise control over how your code is<br />
unboxed. The following program was at one point an entry in the<br />
[http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=ghc&lang2=ghc Great Language Shootout]. <br />
GHC did a good job unboxing the loop, but wouldn't generate the best loop. The<br />
solution was to unbox the loop function by hand, resulting in better code.<br />
<br />
The original code:<br />
<br />
loop :: Int -> Double -> Double<br />
loop d s = if d == 0 then s<br />
else loop (d-1) (s + 1/fromIntegral d)<br />
The hand-unboxed code (note that it is uglier, and harder to read):<br />
<br />
import GHC.Base<br />
import GHC.Float<br />
<br />
loop :: Int# -> Double# -> Double#<br />
loop d s = if d ==# 0# then s <br />
else loop (d -# 1#) (s +## (1.0## /## int2Double# d))<br />
<br />
GHC 6.4.1 compiles the first loop to:<br />
<br />
$wloop :: Int# -> Double# -> Double#<br />
$wloop = \ (ww_s2Ga :: Int#) (ww1_s2Ge :: Double#) -><br />
case Double# ww_s2Ga of wild_XC {<br />
__DEFAULT -><br />
case /## 1.0 (int2Double# wild_XC) of y_a2Cd { <br />
__DEFAULT -> $wloop (-# wild_XC 1) (+## ww1_s2Ge y_a2Cd)<br />
};<br />
0 -> ww1_s2Ge<br />
}<br />
<br />
And the second, unboxed loop is translated to<br />
<br />
loop1 :: Int# -> Double# -> Double#<br />
loop1 = \ (d_a1as :: Int#) (s_a1at :: Double#) -><br />
case Double# d_a1as of wild_B1 {<br />
__DEFAULT -> loop1 (-# wild_B1 1) (+## s_a1at (/## 1.0 (int2Double# wild_B1)));<br />
0 -> s_a1at<br />
}<br />
<br />
which contains 1 less case statement. The second version runs as fast as C, the<br />
first a bit slower. A similar problem was also solved with explicit unboxing in the [http://shootout.alioth.debian.org/gp4/benchmark.php?test=recursive&lang=all recursive benchmark entry].<br />
<br />
== Primops ==<br />
<br />
If you really, really need the speed, and other techniques don't seem to<br />
be helping, programming your code in raw GHC primops can sometimes do<br />
the job. As for unboxed types, you get some guarantees that your code's<br />
performance isn't subject to changes to the GHC optimisations, at the<br />
cost of more unreadable code.<br />
<br />
For example, in an imperative benchmark program a bottleneck was<br />
swapping two values. Raw primops solved the problem:<br />
<br />
swap i j a s =<br />
if i <# j then case readIntOffAddr# a i s of { (# s, x #) -><br />
case readIntOffAddr# a j s of { (# s, y #) -><br />
case writeIntOffAddr# a j x s of { s -><br />
case writeIntOffAddr# a i y s of { s -><br />
swap (i +# 1#) (j -# 1#) a s<br />
}}}}<br />
else (# s, () #)<br />
{-# INLINE swap #-}<br />
<br />
== Inlining ==<br />
<br />
GHC does a lot of inlining, which has a dramatic effect on performance.<br />
<br />
Without -O, GHC does inlining ''within'' a module, but no ''cross-module'' inlining. <br />
<br />
With -O, it does a lot of cross-module inlining. Indeed, generally<br />
speaking GHC will inline ''across'' modules just as much as it does<br />
''within'' modules, with a single large exception. If GHC sees that a<br />
function 'f' is called just once, it inlines it regardless of how big<br />
'f' is. But once 'f' is exported, GHC can never see that it's called<br />
exactly once, even if that later turns out to be the case. This<br />
inline-once optimisation is pretty important in practice. <br />
<br />
So: if you care about performance, do not export functions that are not used outside the module (i.e. use an explicit export list, and keep it as small as possible).<br />
<br />
Sometimes ''explicitly'' inlining critical chunks of code can help.<br />
The INLINE pragma can be used for this purpose; but not for recursive functions, since inlining them forever would obviously be a bad idea.<br />
<br />
If a function you want inlined contains a slow path, it can help a<br />
good deal to separate the slow path into its own function and NOINLINE<br />
it. <br />
<br />
== Looking at the Core ==<br />
<br />
GHC's compiler intermediate language can be very useful for improving<br />
the performance of your code. Core is a functional language much like a very<br />
stripped down Haskell (by design), so it's still readable, and still purely<br />
functional. The general technique is to iteratively inspect how the critical<br />
functions of your program are compiled to Core, checking that they're compiled<br />
in the most optimal manner. Sometimes GHC doesn't quite manage to unbox your<br />
function arguments, float out common subexpressions, or unfold loops ideally --<br />
but you'll only know if you read the Core.<br />
<br />
References:<br />
* [http://haskell.org/ghc/docs/papers/core.ps.gz An External Representation for the GHC Core Language], Andrew Tolmach<br />
* [http://research.microsoft.com/Users/simonpj/Papers/comp-by-trans-scp.ps.gz A transformation-based optimiser for Haskell], Simon L. Peyton Jones and Andre Santos<br />
* [http://research.microsoft.com/Users/simonpj/Papers/inlining/index.htm Secrets of the Glasgow Haskell Compiler Inliner], Simon L. Peyton Jones and Simon Marlow<br />
* [http://research.microsoft.com/users/simonpj/papers/spineless-tagless-gmachine.ps.gz Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine], Simon L. Peyton Jones<br />
<br />
== Core by example ==<br />
<br />
Here's a step-by-step guide to optimising a particular program, <br />
the [http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=ghc&id=2 partial-sums problem] from the [http://shootout.alioth.debian.org Great Language Shootout]. We developed a number<br />
of examples on [http://haskell.org/haskellwiki/Shootout/Partial_sums Haskell shootout entry] page.<br />
<br />
Begin with the naive translation of the Clean entry (which was fairly quick):<br />
Lots of math in a tight loop.<br />
<br />
import System<br />
import Numeric<br />
<br />
main = do n <- getArgs >>= readIO . head<br />
let sums = loop 1 n 1 0 0 0 0 0 0 0 0 0<br />
fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ "\t" ++ t<br />
mapM_ (fn :: (Double, String) - IO ()) (zip sums names)<br />
<br />
names = ["(2/3)^k", "k^-0.5", "1/k(k+1)", "Flint Hills", "Cookson Hills"<br />
, "Harmonic", "Riemann Zeta", "Alternating Harmonic", "Gregory"]<br />
<br />
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9<br />
| k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]<br />
| otherwise = loop (k+1) n (-alt)<br />
(a1 + (2/3) ** (k-1))<br />
(a2 + k ** (-0.5))<br />
(a3 + 1 / (k * (k + 1)))<br />
(a4 + 1 / (k*k*k * sin k * sin k))<br />
(a5 + 1 / (k*k*k * cos k * cos k))<br />
(a6 + 1 / k)<br />
(a7 + 1 / (k*k))<br />
(a8 + alt / k)<br />
(a9 + alt / (2 * k - 1))<br />
<br />
Compiled with '''-O2''' it runs. However, the performance is ''really'' bad.<br />
Somewhere greater than 128M heap -- in fact eventually running out of<br />
memory. A classic space leak. So look at the generated Core. <br />
<br />
=== Inspect the Core ===<br />
<br />
The best way to check the Core that GHC generates is with the<br />
'''-ddump-simpl''' flag (dump the results after code simplification, and<br />
after all optimisations are run). The result can be verbose, so pipe it into a pager.<br />
<br />
Looking for the 'loop', we find that it has been compiled to a function with<br />
the following type:<br />
<br />
$sloop_r2U6 :: GHC.Prim.Double#<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Float.Double<br />
-> GHC.Prim.Double#<br />
-> [GHC.Float.Double]<br />
<br />
Hmm, I certainly don't want boxed doubles in such a tight loop (boxed values<br />
are represented as pointers to closures on the heap, unboxed values are raw<br />
machine values).<br />
<br />
=== Strictify ===<br />
<br />
The next step then is to encourage GHC to unbox this loop, by providing some<br />
strictness annotations. So rewrite the loop like this:<br />
<br />
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9<br />
| () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined<br />
| k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]<br />
| otherwise = loop (k+1) n (-alt)<br />
(a1 + (2/3) ** (k-1))<br />
(a2 + k ** (-0.5))<br />
(a3 + 1 / (k * (k + 1)))<br />
(a4 + 1 / (k*k*k * sin k * sin k))<br />
(a5 + 1 / (k*k*k * cos k * cos k))<br />
(a6 + 1 / k)<br />
(a7 + 1 / (k*k))<br />
(a8 + alt / k)<br />
(a9 + alt / (2 * k - 1)) where x ! y = x `seq` y<br />
<br />
Here the first guard is purely a syntactic trick to inform ghc that the<br />
arguments should be strictly evaluated. I've played a little game here, using<br />
'''!''' for '''`seq`''' is reminiscent of the new bang-pattern proposal for<br />
strictness. Let's see how this compiles. Strictifying all args GHC produces an<br />
inner loop of:<br />
<br />
$sloop_r2WS :: GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> GHC.Prim.Double#<br />
-> [GHC.Float.Double]<br />
<br />
Ah! perfect. Let's see how that runs:<br />
<br />
$ ghc Naive.hs -O2 -no-recomp<br />
$ time ./a.out 2500000<br />
3.000000000 (2/3)^k<br />
3160.817621887 k^-0.5<br />
0.999999600 1/k(k+1)<br />
30.314541510 Flint Hills<br />
42.995233998 Cookson Hills<br />
15.309017155 Harmonic<br />
1.644933667 Riemann Zeta<br />
0.693146981 Alternating Harmonic<br />
0.785398063 Gregory<br />
./a.out 2500000 4.45s user 0.02s system 99% cpu 4.482 total<br />
<br />
=== Crank up the gcc flags ===<br />
<br />
Not too bad. No space leak and quite zippy. But let's see what more can be<br />
done. First, double arithmetic usually (always?) benefits from<br />
-fexcess-precision, and cranking up the flags to gcc:<br />
<br />
paprika$ ghc Naive.hs -O2 -fexcess-precision -optc-O3 -optc-ffast-math -no-recomp<br />
paprika$ time ./a.out 2500000<br />
3.000000000 (2/3)^k<br />
3160.817621887 k^-0.5<br />
0.999999600 1/k(k+1)<br />
30.314541510 Flint Hills<br />
42.995233998 Cookson Hills<br />
15.309017155 Harmonic<br />
1.644933667 Riemann Zeta<br />
0.693146981 Alternating Harmonic<br />
0.785398063 Gregory<br />
./a.out 2500000 3.71s user 0.01s system 99% cpu 3.726 total<br />
<br />
Even better! Now, let's dive into the Core to see if there are any optimisation<br />
opportunites that GHC missed. So add '''-ddump-simpl''' and peruse the output.<br />
<br />
=== Common subexpressions ===<br />
<br />
Looking at the Core, I see firstly that some of the common subexpressions<br />
haven't been factored out:<br />
<br />
case [GHC.Float.Double] GHC.Prim./## 1.0<br />
(GHC.Prim.*## (GHC.Prim.*##<br />
(GHC.Prim.*## (GHC.Prim.*## sc10_s2VS sc10_s2VS) sc10_s2VS)<br />
(GHC.Prim.sinDouble# sc10_s2VS))<br />
(GHC.Prim.sinDouble# sc10_s2VS))<br />
<br />
Multiple calls to '''sin'''. Hmm... And similar for '''cos''' and '''k*k'''. <br />
Simon Peyton-Jones says:<br />
<br />
GHC doesn't do full CSE. It'd be a relatively easy pass for someone to<br />
add, but it can cause space leaks. And it can replace two<br />
strictly-evaluated calls with one lazy thunk:<br />
let { x = case e of ....; y = case e of .... } in ...<br />
==><br />
let { v = e; x = case v of ...; y = case v of ... } in ...<br />
<br />
Instead GHC does "opportunistic CSE". If you have<br />
let x = e in .... let y = e in ....<br />
then it'll discard the duplicate binding. But that's very weak.<br />
<br />
So it looks like we might have to float out the commmon subexpressions by hand.<br />
The inner loop now looks like:<br />
<br />
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9<br />
| () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined<br />
| k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]<br />
| otherwise = loop (k+1) n (-alt)<br />
(a1 + (2/3) ** (k-1))<br />
(a2 + k ** (-0.5))<br />
(a3 + 1 / (k * (k + 1)))<br />
(a4 + 1 / (k3 * sk * sk))<br />
(a5 + 1 / (k3 * ck * ck))<br />
(a6 + 1 / k)<br />
(a7 + 1 / k2)<br />
(a8 + alt / k)<br />
(a9 + alt / (2 * k - 1))<br />
where sk = sin k<br />
ck = cos k<br />
k2 = k * k<br />
k3 = k2 * k<br />
x ! y = x `seq` y<br />
<br />
looking at the Core shows the sins are now allocated and shared:<br />
<br />
let a9_s2MI :: GHC.Prim.Double#<br />
a9_s2MI = GHC.Prim.sinDouble# sc10_s2Xa<br />
<br />
So the common expressions are floated out, and it now runs:<br />
<br />
paprika$ time ./a.out 2500000 <br />
3160.817621887 k^-0.5<br />
0.999999600 1/k(k+1)<br />
30.314541510 Flint Hills<br />
42.995233998 Cookson Hills<br />
15.309017155 Harmonic<br />
1.644933667 Riemann Zeta<br />
0.693146981 Alternating Harmonic<br />
0.785398063 Gregory<br />
./a.out 2500000 3.29s user 0.00s system 99% cpu 3.290 total<br />
<br />
Faster. So we gained 12% by floating out those common expressions.<br />
<br />
See also the [[GCD inlining strictness and CSE]] for another example of<br />
where CSE should be performed to improve performance.<br />
<br />
=== Strength reduction ===<br />
<br />
Finally, another trick -- manual <br />
[http://en.wikipedia.org/wiki/Strength_reduction strength reduction]. When I checked the C<br />
entry, it used an integer for the k parameter to the loop, and cast it<br />
to a double for the math each time around, so perhaps we can make it an<br />
Int parameter. Secondly, the alt parameter only has it's sign flipped<br />
each time, so perhaps we can factor out the alt / k arg (it's either 1 /<br />
k or -1 on k), saving a division. Thirdly, '''(k ** (-0.5))''' is just a<br />
slow way of doing a '''sqrt'''.<br />
<br />
The final loop looks like:<br />
<br />
loop i n alt a1 a2 a3 a4 a5 a6 a7 a8 a9<br />
| i !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined -- strict<br />
| k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]<br />
| otherwise = loop (i+1) n (-alt)<br />
(a1 + (2/3) ** (k-1))<br />
(a2 + 1 / sqrt k)<br />
(a3 + 1 / (k * (k + 1)))<br />
(a4 + 1 / (k3 * sk * sk))<br />
(a5 + 1 / (k3 * ck * ck))<br />
(a6 + dk)<br />
(a7 + 1 / k2)<br />
(a8 + alt * dk)<br />
(a9 + alt / (2 * k - 1))<br />
where k3 = k2*k; k2 = k*k; dk = 1/k; k = fromIntegral i :: Double<br />
sk = sin k; ck = cos k; x!y = x`seq`y<br />
<br />
Checking the generated C code (for another tutorial, perhaps) shows that the<br />
same C operations are generated as the C entry uses.<br />
<br />
And it runs:<br />
$ time ./i 2500000<br />
3.000000200 (2/3)^k<br />
3186.765000000 k^-0.5<br />
0.999852700 1/k(k+1)<br />
30.314493000 Flint Hills<br />
42.995068000 Cookson Hills<br />
15.403683000 Harmonic<br />
1.644725300 Riemann Zeta<br />
0.693137470 Alternating Harmonic<br />
0.785399100 Gregory<br />
./i 2500000 2.37s user 0.01s system 99% cpu 2.389 total<br />
<br />
A big speedup!<br />
<br />
This entry in fact <br />
[http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=all runs] <br />
faster than hand optimised (and vectorised) GCC! And is only slower than<br />
optimised Fortran. Lesson: Haskell can be very, very fast.<br />
<br />
So, by carefully tweaking things, we first squished a space leak, and then<br />
gained another 45%.<br />
<br />
=== Summary ===<br />
<br />
* Manually inspect the Core that is generated<br />
* Use strictness annotations to ensure loops are unboxed<br />
* Watch out for optimisations such as CSE and strength reduction that are missed<br />
* Read the generated C for really tight loops.<br />
* Use -fexcess-precision and -optc-ffast-math for doubles<br />
<br />
== Parameters ==<br />
<br />
On x86 (possibly others), adding parameters to a loop is rather<br />
expensive, and it can be a large win to "hide" your parameters in a<br />
mutable array. (Note that this is the kind of thing quite likely to<br />
change between GHC versions, so measure before using this trick!)<br />
<br />
== Pattern matching ==<br />
<br />
On rare occasions pattern matching can give improvements in code that<br />
needs to repeatedly take apart data structures. This code:<br />
<br />
flop :: Int -> [Int] -> [Int]<br />
flop n xs = rs<br />
where (rs, ys) = fl n xs ys<br />
fl 0 xs ys = (ys, xs)<br />
fl n (x:xs) ys = fl (n-1) xs (x:ys)<br />
<br />
Can be rewritten to be faster (and more ugly) as:<br />
<br />
flop :: Int -> [Int] -> [Int]<br />
flop 2 (x1:x2:xs) = x2:x1:xs<br />
flop 3 (x1:x2:x3:xs) = x3:x2:x1:xs<br />
flop 4 (x1:x2:x3:x4:xs) = x4:x3:x2:x1:xs<br />
flop 5 (x1:x2:x3:x4:x5:xs) = x5:x4:x3:x2:x1:xs<br />
flop 6 (x1:x2:x3:x4:x5:x6:xs) = x6:x5:x4:x3:x2:x1:xs<br />
flop 7 (x1:x2:x3:x4:x5:x6:x7:xs) = x7:x6:x5:x4:x3:x2:x1:xs<br />
flop 8 (x1:x2:x3:x4:x5:x6:x7:x8:xs) = x8:x7:x6:x5:x4:x3:x2:x1:xs<br />
flop 9 (x1:x2:x3:x4:x5:x6:x7:x8:x9:xs) = x9:x8:x7:x6:x5:x4:x3:x2:x1:xs<br />
flop 10 (x1:x2:x3:x4:x5:x6:x7:x8:x9:x10:xs) = x10:x9:x8:x7:x6:x5:x4:x3:x2:x1:xs<br />
flop n xs = rs<br />
where (rs, ys) = fl n xs ys<br />
fl 0 xs ys = (ys, xs)<br />
fl n (x:xs) ys = fl (n-1) xs (x:ys)<br />
<br />
== Arrays ==<br />
<br />
If you are using array access and GHC primops, do not be too eager to<br />
use raw Addr#esses; MutableByteArray# is just as fast and frees you<br />
from memory management.<br />
<br />
== Memory allocation and arrays ==<br />
<br />
When you are allocating arrays, it may help to know a little about GHC's memory allocator. There are lots of deatils in [http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage The GHC Commentary]), but here are some useful facts:<br />
<br />
* For larger objects ghc has an allocation granularity of 4k. That is it always uses a multiple of 4k bytes, which can lead to wasteage of up to 4k per array. Furthermore, a byte array has some overhead: it needs one word for the heap cell header and another for the length. So if you allocate a 4k byte array then it uses 8k. So the trick is to allocate 4k - overhead. This is what the Data.ByteString library does<br />
<br />
* GHC allocates memory from the OS in units of a "megablock", currently 1Mbyte. So if you allocate a 1Mb array, the storage manager has to allocate 1Mb + overhead, which will cause it to allocate a 2Mb megablock. The surplus will be returned to the system in the form of free blocks, but if all you do is allocate lots of 1Mb arrays, you'll waste about half the space because there's never enough contiguous free space to contain another 1Mb array. Similar problem for 512k arrays: the storage manager allocates a 1Mb block, and returns slightly less than half of it as free blocks, so each 512k allocation takes a whole new 1Mb block.<br />
<br />
== Rewrite rules ==<br />
<br />
Algebraic properties in your code might be missed by the GHC optimiser.<br />
You can use [[Playing by the rules|user-supplied rewrite rules]] to<br />
teach the compiler to optimise your code using domain-specific<br />
optimisations.<br />
<br />
<br />
== See also == <br />
<br />
* [http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html Faster: producing a program that runs quicker] (part of the GHC User's Guide)<br />
<br />
* [http://stackoverflow.com/questions/12653787/what-optimizations-can-ghc-be-expected-to-perform-reliably What optimizations can GHC be expected to perform reliably?] (stackoverflow.com)</div>Libbyhttps://wiki.haskell.org/index.php?title=Hac_%CF%86/Projects&diff=61205Hac φ/Projects2016-10-16T22:09:29Z<p>Libby: /* Projects */</p>
<hr />
<div>== Sharing your code ==<br />
<br />
If you need a place to host a project so that others can help with it, we suggest using [http://github.com github] with [https://git-scm.com/ git]. You can also apply for an account on [http://community.haskell.org/admin/ the community server].<br />
<br />
== Projects ==<br />
<br />
If you have a project that you want to work on at the Hackathon, please describe it here.<br />
<br />
Since Hackathons are great for teamwork, consider joining one of the projects mentioned below. If you're interested in one of these projects, add your name to the list of hackers under that project.<br />
<br />
<!-- Copy this template<br />
=== Project name ===<br />
<br />
I am a project. Love me.<br />
<br />
* Hacker 1<br />
* Hacker 2<br />
--><br />
<br />
=== GHC ===<br />
<br />
Richard will likely be doing some GHC hacking, as he is wont to do. Others: please join in the fun, and suggestions for projects can be provided!<br />
<br />
* Richard Eisenberg<br />
<br />
=== CodeWorld ===<br />
<br />
Chris Smith will likely be hacking on CodeWorld, a web-based K-12 education environment using Haskell. Possible projects include editor improvements, mobile app export, a constructive geometry API, debugging and collaborative editing, or bug fixes and UI tweaks.<br />
<br />
* Chris Smith<br />
* Libby Horacek<br />
<br />
=== Legion ===<br />
<br />
[https://github.com/owensmurray/legion Legion] is a framework for writing stateful, elastically scalable, AP microservices. I'm probably going to be working on the cluster rebalancing system. Also, help from anyone who knows a lot about gossip protocols would be appreciated.<br />
<br />
* Rick Owens<br />
<br />
=== FLTKHS ===<br />
<br />
[https://github.com/deech/fltkhs FLTKHS] is a Haskell GUI library for getting native cross platform GUI up and running quickly and easily. I'm probably going to work on patching the missing Linux High DPI support. Any feedback or suggestions for future enhancements would be appreciated.<br />
<br />
* Aditya Siram<br />
<br />
=== Opaleye ===<br />
<br />
I'd like to learn more about [https://github.com/tomjaguarpaw/haskell-opaleye Opaleye] (SQL generation) probably by writing some documentation or fixing bugs.<br />
<br />
* Libby Horacek<br />
<br />
== Experience ==<br />
<br />
Please list projects with which you are familiar. This way, people know whom to contact for more information or guidance on a particular project.<br />
<br />
{| class="wikitable"<br />
! Name<br />
! Projects<br />
|-<br />
| Chris Smith<br />
| CodeWorld, Snap, xmlhtml<br />
|-<br />
|}<br />
<br />
[[Category:Community]]</div>Libby