Difference between revisions of "Performance/GHC"

Revision as of 19:58, 22 July 2012

Haskell Performance Resource Constructs: Data Types - Functions Overloading - FFI - Arrays Strings - Integers - I/O Floating point - Concurrency Modules - Monads Techniques: Strictness - Laziness Avoiding space leaks Accumulating parameter Implementation-Specific: GHC - nhc98 - Hugs Yhc - JHC

Please report any overly-slow GHC-compiled programs. Since GHC doesn't have any credible competition in the performance department these days it's hard to say what overly-slow means, so just use your judgement! Of course, if a GHC compiled program runs slower than the same program compiled with another Haskell compiler, then it's definitely a bug. Furthermore, if an equivalent OCaml, SML or Clean program is faster, this might be a bug.

Use optimisation

Optimise, using -O or -O2: this is the most basic way to make your program go faster. Compilation time will be slower, especially with -O2.

At present, -O2 is nearly indistinguishable from -O.

GHCi cannot optimise interpreted code, so when using GHCi, compile critical modules using -O or -O2, then load them into GHCi.

Here is a short summary of useful compile time flags:

-O:
-O2:
-funfolding-use-threshold=16: demand more inlining.
-fexcess-precision: see Performance/Floating_point
-optc-O3: Enables a suite of optimizations in the GCC compiler. See the gcc(1) man-page for details. (a C-compiler option).
-optc-ffast-math: A C-compiler option which allows it to be less strict with respect to the standard when compiling IEEE 754 floating point arithmetic. Math operations will not trap if something goes wrong and math operations will assume that NaN and +- Infinity are not in arguments or results. For most practical floating point processing, this is a non-issue and enabling the flag can speed up FP arithmetic by a considerable amount. Also see the gcc(1) man-page. (a C-compiler option).

Other useful flags:

-ddump-simpl > core.txt: generate core.txt file (see below).

Measuring performance

The first thing to do is measure the performance of your program, and find out whether all the time is being spent in the garbage collector or not. Run your program with the +RTS -sstderr option:

$ ./clausify 20 +RTS -sstderr
42,764,972 bytes allocated in the heap
 6,915,348 bytes copied during GC (scavenged)
   360,448 bytes copied during GC (not scavenged)
    36,616 bytes maximum residency (7 sample(s))

        81 collections in generation 0 (  0.07s)
         7 collections in generation 1 (  0.00s)

         2 Mb total memory in use

 INIT  time    0.00s  (  0.00s elapsed)
 MUT   time    0.65s  (  0.94s elapsed)
 GC    time    0.07s  (  0.06s elapsed)
 EXIT  time    0.00s  (  0.00s elapsed)
 Total time    0.72s  (  1.00s elapsed)

 %GC time       9.7%  (6.0% elapsed)

 Alloc rate    65,792,264 bytes per MUT second

 Productivity  90.3% of total user, 65.1% of total elapsed

This tells you how much time is being spent running the program itself (MUT time), and how much time spent in the garbage collector (GC time).

If your program is doing a lot of GC, then your first priority should be to check for Space Leaks using heap profiling, and then to try to reduce allocations by time and allocation profiling.

If you can't reduce the GC cost any further, then using more memory by tweaking the GC options will probably help. For example, increasing the default heap size with +RTS -H128m will reduce the number of GCs.

If your program isn't doing too much GC, then you should proceed to time and allocation profiling to see where the big hitters are.

Modules and separate compilation

In general, splitting code across modules should not make programs less efficient. GHC does quite aggressive cross-module inlining: when you import a function f from another module M, GHC consults the "interface file" M.hi to get f's definition.

For best results, use an explicit export list. If you do, GHC can inline any non-exported functions that are only called once, even if they are very big. Without an explicit export list, GHC must assume that every function is exported, and hence (to avoid code bloat) is more conservative about inlining.

There is one exception to the general rule that splitting code across modules does not harm performance. As mentioned above, if a non-exported non-recursive function is called exactly once, then it is inlined regardless of size, because doing so does not cause code duplication. But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times. You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively.

Unboxed types

When you are really desperate for speed, and you want to get right down to the “raw bits.” Please see GHC Primitives for some information about using unboxed types.

This should be a last resort, however, since unboxed types and primitives are non-portable. Fortunately, it is usually not necessary to resort to using explicit unboxed types and primitives, because GHC's optimiser can do the work for you by inlining operations it knows about, and unboxing strict function arguments (see Performance/Strictness). Strict and unpacked constructor fields can also help a lot (see Performance/Data Types). Sometimes GHC needs a little help to generate the right code, so you might have to look at the Core output to see whether your tweaks are actually having the desired effect.

One thing that can be said for using unboxed types and primitives is that you know you're writing efficient code, rather than relying on GHC's optimiser to do the right thing, and being at the mercy of changes in GHC's optimiser down the line. This may well be important to you, in which case go for it.

An example

Usually unboxing is not explicitly required (see the Core tutorial below), however there are circumstances where you require precise control over how your code is unboxed. The following program was at one point an entry in the Great Language Shootout. GHC did a good job unboxing the loop, but wouldn't generate the best loop. The solution was to unbox the loop function by hand, resulting in better code.

The original code:

loop :: Int -> Double -> Double
loop d s = if d == 0 then s
                     else loop (d-1) (s + 1/fromIntegral d)

The hand-unboxed code (note that it is uglier, and harder to read):

import GHC.Base
import GHC.Float

loop :: Int# -> Double# -> Double#
loop d s = if d ==# 0# then s 
                       else loop (d -# 1#) (s +## (1.0## /## int2Double# d))

GHC 6.4.1 compiles the first loop to:

$wloop :: Int# -> Double# -> Double#
$wloop = \ (ww_s2Ga :: Int#) (ww1_s2Ge :: Double#) ->
   case Double# ww_s2Ga of wild_XC {
     __DEFAULT ->
       case /## 1.0 (int2Double# wild_XC) of y_a2Cd { 
           __DEFAULT -> $wloop (-# wild_XC 1) (+## ww1_s2Ge y_a2Cd)
       };
     0 -> ww1_s2Ge
   }

And the second, unboxed loop is translated to

loop1 :: Int# -> Double# -> Double#
loop1 = \ (d_a1as :: Int#) (s_a1at :: Double#) ->
   case Double# d_a1as of wild_B1 {
     __DEFAULT -> loop1 (-# wild_B1 1) (+## s_a1at (/## 1.0 (int2Double# wild_B1)));
     0 -> s_a1at
  }

which contains 1 less case statement. The second version runs as fast as C, the first a bit slower. A similar problem was also solved with explicit unboxing in the recursive benchmark entry.

Primops

If you really, really need the speed, and other techniques don't seem to be helping, programming your code in raw GHC primops can sometimes do the job. As for unboxed types, you get some guarantees that your code's performance isn't subject to changes to the GHC optimisations, at the cost of more unreadable code.

For example, in an imperative benchmark program a bottleneck was swapping two values. Raw primops solved the problem:

swap i j a s =
   if i <# j then case readIntOffAddr#  a i s   of { (# s, x #) ->
                  case readIntOffAddr#  a j s   of { (# s, y #) ->
                  case writeIntOffAddr# a j x s of { s          ->
                  case writeIntOffAddr# a i y s of { s          ->
                  swap (i +# 1#) (j -# 1#) a s
                  }}}}
             else (# s, () #)
   {-# INLINE swap #-}

Inlining

GHC does a lot of inlining, which has a dramatic effect on performance.

Without -O, GHC does inlining within a module, but no cross-module inlining.

With -O, it does a lot of cross-module inlining. Indeed, generally speaking GHC will inline across modules just as much as it does within modules, with a single large exception. If GHC sees that a function 'f' is called just once, it inlines it regardless of how big 'f' is. But once 'f' is exported, GHC can never see that it's called exactly once, even if that later turns out to be the case. This inline-once optimisation is pretty important in practice.

So: if you care about performance, do not export functions that are not used outside the module (i.e. use an explicit export list, and keep it as small as possible).

Sometimes explicitly inlining critical chunks of code can help. The INLINE pragma can be used for this purpose; but not for recursive functions, since inlining them forever would obviously be a bad idea.

If a function you want inlined contains a slow path, it can help a good deal to separate the slow path into its own function and NOINLINE it.

Looking at the Core

GHC's compiler intermediate language can be very useful for improving the performance of your code. Core is a functional language much like a very stripped down Haskell (by design), so it's still readable, and still purely functional. The general technique is to iteratively inspect how the critical functions of your program are compiled to Core, checking that they're compiled in the most optimal manner. Sometimes GHC doesn't quite manage to unbox your function arguments, float out common subexpressions, or unfold loops ideally -- but you'll only know if you read the Core.

References:

An External Representation for the GHC Core Language, Andrew Tolmach
A transformation-based optimiser for Haskell, Simon L. Peyton Jones and Andre Santos
Secrets of the Glasgow Haskell Compiler Inliner, Simon L. Peyton Jones and Simon Marlow
Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine, Simon L. Peyton Jones

Core by example

Here's a step-by-step guide to optimising a particular program, the partial-sums problem from the Great Language Shootout. We developed a number of examples on Haskell shootout entry page.

Begin with the naive translation of the Clean entry (which was fairly quick): Lots of math in a tight loop.

import System
import Numeric

main = do n <- getArgs >>= readIO . head
          let sums     = loop 1 n 1 0 0 0 0 0 0 0 0 0
              fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ "\t" ++ t
          mapM_ (fn :: (Double, String) - IO ()) (zip sums names)

names = ["(2/3)^k", "k^-0.5", "1/k(k+1)", "Flint Hills", "Cookson Hills"
        , "Harmonic", "Riemann Zeta", "Alternating Harmonic", "Gregory"]

loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
    | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
    | otherwise = loop (k+1) n (-alt)
                       (a1 + (2/3) ** (k-1))
                       (a2 + k ** (-0.5))
                       (a3 + 1 / (k * (k + 1)))
                       (a4 + 1 / (k*k*k * sin k * sin k))
                       (a5 + 1 / (k*k*k * cos k * cos k))
                       (a6 + 1 / k)
                       (a7 + 1 / (k*k))
                       (a8 + alt / k)
                       (a9 + alt / (2 * k - 1))

Compiled with -O2 it runs. However, the performance is really bad. Somewhere greater than 128M heap -- in fact eventually running out of memory. A classic space leak. So look at the generated Core.

Inspect the Core

The best way to check the Core that GHC generates is with the -ddump-simpl flag (dump the results after code simplification, and after all optimisations are run). The result can be verbose, so pipe it into a pager.

Looking for the 'loop', we find that it has been compiled to a function with the following type:

$sloop_r2U6 :: GHC.Prim.Double#
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Float.Double
               -> GHC.Prim.Double#
               -> [GHC.Float.Double]

Hmm, I certainly don't want boxed doubles in such a tight loop (boxed values are represented as pointers to closures on the heap, unboxed values are raw machine values).

Strictify

The next step then is to encourage GHC to unbox this loop, by providing some strictness annotations. So rewrite the loop like this:

loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
    | () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined
    | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
    | otherwise = loop (k+1) n (-alt)
                       (a1 + (2/3) ** (k-1))
                       (a2 + k ** (-0.5))
                       (a3 + 1 / (k * (k + 1)))
                       (a4 + 1 / (k*k*k * sin k * sin k))
                       (a5 + 1 / (k*k*k * cos k * cos k))
                       (a6 + 1 / k)
                       (a7 + 1 / (k*k))
                       (a8 + alt / k)
                       (a9 + alt / (2 * k - 1)) where x ! y = x `seq` y

Here the first guard is purely a syntactic trick to inform ghc that the arguments should be strictly evaluated. I've played a little game here, using ! for `seq` is reminiscent of the new bang-pattern proposal for strictness. Let's see how this compiles. Strictifying all args GHC produces an inner loop of:

$sloop_r2WS :: GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> [GHC.Float.Double]

Ah! perfect. Let's see how that runs:

$ ghc Naive.hs -O2 -no-recomp
$ time ./a.out 2500000
3.000000000     (2/3)^k
3160.817621887  k^-0.5
0.999999600     1/k(k+1)
30.314541510    Flint Hills
42.995233998    Cookson Hills
15.309017155    Harmonic
1.644933667     Riemann Zeta
0.693146981     Alternating Harmonic
0.785398063     Gregory
./a.out 2500000  4.45s user 0.02s system 99% cpu 4.482 total

Crank up the gcc flags

Not too bad. No space leak and quite zippy. But let's see what more can be done. First, double arithmetic usually (always?) benefits from -fexcess-precision, and cranking up the flags to gcc:

paprika$ ghc Naive.hs -O2 -fexcess-precision -optc-O3 -optc-ffast-math -no-recomp
paprika$ time ./a.out 2500000
3.000000000     (2/3)^k
3160.817621887  k^-0.5
0.999999600     1/k(k+1)
30.314541510    Flint Hills
42.995233998    Cookson Hills
15.309017155    Harmonic
1.644933667     Riemann Zeta
0.693146981     Alternating Harmonic
0.785398063     Gregory
./a.out 2500000  3.71s user 0.01s system 99% cpu 3.726 total

Even better! Now, let's dive into the Core to see if there are any optimisation opportunites that GHC missed. So add -ddump-simpl and peruse the output.

Common subexpressions

Looking at the Core, I see firstly that some of the common subexpressions haven't been factored out:

case [GHC.Float.Double] GHC.Prim./## 1.0
  (GHC.Prim.*## (GHC.Prim.*##
                    (GHC.Prim.*## (GHC.Prim.*## sc10_s2VS sc10_s2VS) sc10_s2VS)
                    (GHC.Prim.sinDouble# sc10_s2VS))
                 (GHC.Prim.sinDouble# sc10_s2VS))

Multiple calls to sin. Hmm... And similar for cos and k*k. Simon Peyton-Jones says:

GHC doesn't do full CSE.  It'd be a relatively easy pass for someone to
add, but it can cause space leaks.  And it can replace two
strictly-evaluated calls with one lazy thunk:
        let { x = case e of ....;  y = case e of .... } in ...
  ==>
        let { v = e; x = case v of ...; y = case v of ... } in ...

Instead GHC does "opportunistic CSE".  If you have
        let x = e in .... let y = e in ....
then it'll discard the duplicate binding. But that's very weak.

So it looks like we might have to float out the commmon subexpressions by hand. The inner loop now looks like:

loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
    | () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined
    | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
    | otherwise = loop (k+1) n (-alt)
                       (a1 + (2/3) ** (k-1))
                       (a2 + k ** (-0.5))
                       (a3 + 1 / (k * (k + 1)))
                       (a4 + 1 / (k3 * sk * sk))
                       (a5 + 1 / (k3 * ck * ck))
                       (a6 + 1 / k)
                       (a7 + 1 / k2)
                       (a8 + alt / k)
                       (a9 + alt / (2 * k - 1))
    where sk = sin k
          ck = cos k
          k2 = k * k
          k3 = k2 * k
          x ! y = x `seq` y

looking at the Core shows the sins are now allocated and shared:

let a9_s2MI :: GHC.Prim.Double#
    a9_s2MI = GHC.Prim.sinDouble# sc10_s2Xa

So the common expressions are floated out, and it now runs:

paprika$ time ./a.out 2500000                          
3160.817621887  k^-0.5
0.999999600     1/k(k+1)
30.314541510    Flint Hills
42.995233998    Cookson Hills
15.309017155    Harmonic
1.644933667     Riemann Zeta
0.693146981     Alternating Harmonic
0.785398063     Gregory
./a.out 2500000  3.29s user 0.00s system 99% cpu 3.290 total

Faster. So we gained 12% by floating out those common expressions.

See also the GCD inlining strictness and CSE for another example of where CSE should be performed to improve performance.

Strength reduction

Finally, another trick -- manual strength reduction. When I checked the C entry, it used an integer for the k parameter to the loop, and cast it to a double for the math each time around, so perhaps we can make it an Int parameter. Secondly, the alt parameter only has it's sign flipped each time, so perhaps we can factor out the alt / k arg (it's either 1 / k or -1 on k), saving a division. Thirdly, (k ** (-0.5)) is just a slow way of doing a sqrt.

The final loop looks like:

loop i n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
    | i !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined -- strict
    | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
    | otherwise = loop (i+1) n (-alt)
                       (a1 + (2/3) ** (k-1))
                       (a2 + 1 / sqrt k)
                       (a3 + 1 / (k * (k + 1)))
                       (a4 + 1 / (k3 * sk * sk))
                       (a5 + 1 / (k3 * ck * ck))
                       (a6 + dk)
                       (a7 + 1 / k2)
                       (a8 + alt * dk)
                       (a9 + alt / (2 * k - 1))
    where k3 = k2*k; k2 = k*k; dk = 1/k; k = fromIntegral i :: Double
          sk = sin k; ck = cos k; x!y = x`seq`y

Checking the generated C code (for another tutorial, perhaps) shows that the same C operations are generated as the C entry uses.

And it runs:

$ time ./i 2500000
3.000000200     (2/3)^k
3186.765000000  k^-0.5
0.999852700     1/k(k+1)
30.314493000    Flint Hills
42.995068000    Cookson Hills
15.403683000    Harmonic
1.644725300     Riemann Zeta
0.693137470     Alternating Harmonic
0.785399100     Gregory
./i 2500000  2.37s user 0.01s system 99% cpu 2.389 total

A big speedup!

This entry in fact runs faster than hand optimised (and vectorised) GCC! And is only slower than optimised Fortran. Lesson: Haskell can be very, very fast.

So, by carefully tweaking things, we first squished a space leak, and then gained another 45%.

Summary

Manually inspect the Core that is generated
Use strictness annotations to ensure loops are unboxed
Watch out for optimisations such as CSE and strength reduction that are missed
Read the generated C for really tight loops.
Use -fexcess-precision and -optc-ffast-math for doubles

Parameters

On x86 (possibly others), adding parameters to a loop is rather expensive, and it can be a large win to "hide" your parameters in a mutable array. (Note that this is the kind of thing quite likely to change between GHC versions, so measure before using this trick!)

Pattern matching

On rare occasions pattern matching can give improvements in code that needs to repeatedly take apart data structures. This code:

flop :: Int -> [Int] -> [Int]
flop n xs = rs
    where (rs, ys)       = fl n xs ys
          fl 0 xs     ys = (ys, xs)
          fl n (x:xs) ys = fl (n-1) xs (x:ys)

Can be rewritten to be faster (and more ugly) as:

flop :: Int -> [Int] -> [Int]
flop 2 (x1:x2:xs) = x2:x1:xs
flop 3 (x1:x2:x3:xs) = x3:x2:x1:xs
flop 4 (x1:x2:x3:x4:xs) = x4:x3:x2:x1:xs
flop 5 (x1:x2:x3:x4:x5:xs) = x5:x4:x3:x2:x1:xs
flop 6 (x1:x2:x3:x4:x5:x6:xs) = x6:x5:x4:x3:x2:x1:xs
flop 7 (x1:x2:x3:x4:x5:x6:x7:xs) = x7:x6:x5:x4:x3:x2:x1:xs
flop 8 (x1:x2:x3:x4:x5:x6:x7:x8:xs) = x8:x7:x6:x5:x4:x3:x2:x1:xs
flop 9 (x1:x2:x3:x4:x5:x6:x7:x8:x9:xs) = x9:x8:x7:x6:x5:x4:x3:x2:x1:xs
flop 10 (x1:x2:x3:x4:x5:x6:x7:x8:x9:x10:xs) = x10:x9:x8:x7:x6:x5:x4:x3:x2:x1:xs
flop n xs = rs
  where (rs, ys)       = fl n xs ys
        fl 0 xs     ys = (ys, xs)
        fl n (x:xs) ys = fl (n-1) xs (x:ys)

Arrays

If you are using array access and GHC primops, do not be too eager to use raw Addr#esses; MutableByteArray# is just as fast and frees you from memory management.

Memory allocation and arrays

When you are allocating arrays, it may help to know a little about GHC's memory allocator. There are lots of deatils in The GHC Commentary), but here are some useful facts:

For larger objects ghc has an allocation granularity of 4k. That is it always uses a multiple of 4k bytes, which can lead to wasteage of up to 4k per array. Furthermore, a byte array has some overhead: it needs one word for the heap cell header and another for the length. So if you allocate a 4k byte array then it uses 8k. So the trick is to allocate 4k - overhead. This is what the Data.ByteString library does

GHC allocates memory from the OS in units of a "megablock", currently 1Mbyte. So if you allocate a 1Mb array, the storage manager has to allocate 1Mb + overhead, which will cause it to allocate a 2Mb megablock. The surplus will be returned to the system in the form of free blocks, but if all you do is allocate lots of 1Mb arrays, you'll waste about half the space because there's never enough contiguous free space to contain another 1Mb array. Similar problem for 512k arrays: the storage manager allocates a 1Mb block, and returns slightly less than half of it as free blocks, so each 512k allocation takes a whole new 1Mb block.

Rewrite rules

Algebraic properties in your code might be missed by the GHC optimiser. You can use user-supplied rewrite rules to teach the compiler to optimise your code using domain-specific optimisations.