Difference between revisions of "Performance/GHC"
(→Inlining) |
(Note badness of -O3) |
||
Line 14: | Line 14: | ||
* <tt>-O</tt>: |
* <tt>-O</tt>: |
||
* <tt>-O2</tt>: |
* <tt>-O2</tt>: |
||
− | * <tt>-O3</tt>: (how these three differ from each other?) |
||
+ | * Do NOT use <tt>-O3</tt>, it actually gives less optimization than <tt>-O2</tt>, [[http://hackage.haskell.org/trac/ghc/ticket/1261]] |
||
* <tt>-funfolding-use-threshold=16</tt>: demand more inlining. |
* <tt>-funfolding-use-threshold=16</tt>: demand more inlining. |
||
* <tt>-fexcess-precision</tt>: see [[Performance/Floating_point]] |
* <tt>-fexcess-precision</tt>: see [[Performance/Floating_point]] |
Revision as of 14:40, 5 April 2007
Haskell Performance Resource
Constructs: Techniques: |
Please report any overly-slow GHC-compiled programs. Since GHC doesn't have any credible competition in the performance department these days it's hard to say what overly-slow means, so just use your judgement! Of course, if a GHC compiled program runs slower than the same program compiled with another Haskell compiler, then it's definitely a bug. Furthermore, if an equivalent OCaml, SML or Clean program is faster, this might be a bug.
Contents
Use Optimisation
[Optimise, using -O or -O2: this is the most basic way to make your program go faster. Compilation time will be slower, especially with -O2.
At present, -O2 is nearly indistinguishable from -O.
GHCi cannot optimise interpreted code, so when using GHCi, compile critical modules using -O or -O2, then load them into GHCi.
Here is a short summary of useful compile time flags:
- -O:
- -O2:
- Do NOT use -O3, it actually gives less optimization than -O2, [[1]]
- -funfolding-use-threshold=16: demand more inlining.
- -fexcess-precision: see Performance/Floating_point
- -optc-O3: Enables a suite of optimizations in the GCC compiler. See the gcc(1) man-page for details. (a C-compiler option).
- -optc-ffast-math: A C-compiler option which allows it to be less strict with respect to the standard when compiling IEEE 754 floating point arithmetic. Math operations will not trap if something goes wrong and math operations will assume that NaN and +- Infinity are not in arguments or results. For most practical floating point processing, this is a non-issue and enabling the flag can speed up FP arithmetic by a considerable amount. Also see the gcc(1) man-page. (a C-compiler option).
Other useful flags:
- -ddump-simpl > core.txt: generate core.txt file (see below).
Measuring Performance
The first thing to do is measure the performance of your program, and find out whether all the time is being spent in the garbage collector or not. Run your program with the +RTS -sstderr option:
$ ./clausify 20 +RTS -sstderr 42,764,972 bytes allocated in the heap 6,915,348 bytes copied during GC (scavenged) 360,448 bytes copied during GC (not scavenged) 36,616 bytes maximum residency (7 sample(s))
81 collections in generation 0 ( 0.07s) 7 collections in generation 1 ( 0.00s)
2 Mb total memory in use
INIT time 0.00s ( 0.00s elapsed) MUT time 0.65s ( 0.94s elapsed) GC time 0.07s ( 0.06s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.72s ( 1.00s elapsed)
%GC time 9.7% (6.0% elapsed)
Alloc rate 65,792,264 bytes per MUT second
Productivity 90.3% of total user, 65.1% of total elapsed
This tells you how much time is being spent running the program itself (MUT time), and how much time spent in the garbage collector (GC time).
If your program is doing a lot of GC, then your first priority should be to check for Space Leaks using heap profiling, and then to try to reduce allocations by time and allocation profiling.
If you can't reduce the GC cost any further, then using more memory by tweaking the GC options will probably help. For example, increasing the default heap size with +RTS -H128m will reduce the number of GCs.
If your program isn't doing too much GC, then you should proceed to time and allocation profiling to see where the big hitters are.
Unboxed types
When you are really desperate for speed, and you want to get right down to the “raw bits.” Please see GHC Primitives for some information about using unboxed types.
This should be a last resort, however, since unboxed types and primitives are non-portable. Fortunately, it is usually not necessary to resort to using explicit unboxed types and primitives, because GHC's optimiser can do the work for you by inlining operations it knows about, and unboxing strict function arguments (see Performance:Strictness). Strict and unpacked constructor fields can also help a lot (see Performance:Data Types). Sometimes GHC needs a little help to generate the right code, so you might have to look at the Core output to see whether your tweaks are actually having the desired effect.
One thing that can be said for using unboxed types and primitives is that you know you're writing efficient code, rather than relying on GHC's optimiser to do the right thing, and being at the mercy of changes in GHC's optimiser down the line. This may well be important to you, in which case go for it.
An Example
Usually unboxing is not explicitly required (see the Core tutorial below), however there are circumstances where you require precise control over how your code is unboxed. The following program was at one point an entry in the Great Language Shootout. GHC did a good job unboxing the loop, but wouldn't generate the best loop. The solution was to unbox the loop function by hand, resulting in better code.
The original code:
loop :: Int -> Double -> Double loop d s = if d == 0 then s else loop (d-1) (s + 1/fromIntegral d)
The hand-unboxed code (note that it is uglier, and harder to read):
import GHC.Base import GHC.Float
loop :: Int# -> Double# -> Double# loop d s = if d ==# 0# then s else loop (d -# 1#) (s +## (1.0## /## int2Double# d))
GHC 6.4.1 compiles the first loop to:
$wloop :: Int# -> Double# -> Double# $wloop = \ (ww_s2Ga :: Int#) (ww1_s2Ge :: Double#) -> case Double# ww_s2Ga of wild_XC { __DEFAULT -> case /## 1.0 (int2Double# wild_XC) of y_a2Cd { __DEFAULT -> $wloop (-# wild_XC 1) (+## ww1_s2Ge y_a2Cd) }; 0 -> ww1_s2Ge }
And the second, unboxed loop is translated to
loop1 :: Int# -> Double# -> Double# loop1 = \ (d_a1as :: Int#) (s_a1at :: Double#) -> case Double# d_a1as of wild_B1 { __DEFAULT -> loop1 (-# wild_B1 1) (+## s_a1at (/## 1.0 (int2Double# wild_B1))); 0 -> s_a1at }
which contains 1 less case statement. The second version runs as fast as C, the first a bit slower. A similar problem was also solved with explicit unboxing in the recursive benchmark entry.
Primops
If you really, really need the speed, and other techniques don't seem to be helping, programming your code in raw GHC primops can sometimes do the job. As for unboxed types, you get some guarantees that your code's performance isn't subject to changes to the GHC optimisations, at the cost of more unreadable code.
For example, in an imperative benchmark program a bottleneck was swapping two values. Raw primops solved the problem:
swap i j a s = if i <# j then case readIntOffAddr# a i s of { (# s, x #) -> case readIntOffAddr# a j s of { (# s, y #) -> case writeIntOffAddr# a j x s of { s -> case writeIntOffAddr# a i y s of { s -> swap (i +# 1#) (j -# 1#) a s }}}} else (# s, () #) {-# INLINE swap #-}
Inlining
GHC does a lot of inlining, which has a dramatic effect on performance.
Without -O, GHC does inlining within a module, but no cross-module inlining.
With -O, it does a lot of cross-module inlining. Indeed, generally speaking GHC will inline across modules just as much as it does within modules, with a single large exception. If GHC sees that a function 'f' is called just once, it inlines it regardless of how big 'f' is. But once 'f' is exported, GHC can never see that it's called exactly once, even if that later turns out to be the case. This inline-once optimisation is pretty important in practice.
So: if you care about performance, do not export functions that are not used outside the module (i.e. use an explicit export list, and keep it as small as possible).
Sometimes explicitly inlining critical chunks of code can help. The INLINE pragma can be used for this purpose; but not for recursive functions, since inlining them forever would obviously be a bad idea.
Looking at the Core
GHC's compiler intermediate language can be very useful for improving the performance of your code. Core is a functional language much like a very stripped down Haskell (by design), so it's still readable, and still purely functional. The general technique is to iteratively inspect how the critical functions of your program are compiled to Core, checking that they're compiled in the most optimal manner. Sometimes GHC doesn't quite manage to unbox your function arguments, float out common subexpressions, or unfold loops ideally -- but you'll only know if you read the Core.
References:
- An External Representation for the GHC Core Language, Andrew Tolmach
- A transformation-based optimiser for Haskell, Simon L. Peyton Jones and Andre Santos
- Secrets of the Glasgow Haskell Compiler Inliner, Simon L. Peyton Jones and Simon Marlow
- Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine, Simon L. Peyton Jones
Core by Example
Here's a step-by-step guide to optimising a particular program, the partial-sums problem from the Great Language Shootout. We developed a number of examples on Haskell shootout entry page.
Begin with the naive translation of the Clean entry (which was fairly quick): Lots of math in a tight loop.
import System import Numeric
main = do n <- getArgs = readIO . head let sums = loop 1 n 1 0 0 0 0 0 0 0 0 0 fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ "\t" ++ t mapM_ (fn :: (Double, String) - IO ()) (zip sums names)
names = ["(2/3)^k", "k^-0.5", "1/k(k+1)", "Flint Hills", "Cookson Hills" , "Harmonic", "Riemann Zeta", "Alternating Harmonic", "Gregory"]
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | k n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k*k*k * sin k * sin k)) (a5 + 1 / (k*k*k * cos k * cos k)) (a6 + 1 / k) (a7 + 1 / (k*k)) (a8 + alt / k) (a9 + alt / (2 * k - 1))
Compiled with -O2 it runs. However, the performance is really bad. Somewhere greater than 128M heap -- in fact eventually running out of memory. A classic space leak. So look at the generated Core.
Inspect the Core
The best way to check the Core that GHC generates is with the -ddump-simpl flag (dump the results after code simplification, and after all optimisations are run). The result can be verbose, so pipe it into a pager.
Looking for the 'loop', we find that it has been compiled to a function with the following type:
$sloop_r2U6 :: GHC.Prim.Double# -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Prim.Double# -> [GHC.Float.Double]
Hmm, I certainly don't want boxed doubles in such a tight loop (boxed values are represented as pointers to closures on the heap, unboxed values are raw machine values).
Strictify
The next step then is to encourage GHC to unbox this loop, by providing some strictness annotations. So rewrite the loop like this:
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k*k*k * sin k * sin k)) (a5 + 1 / (k*k*k * cos k * cos k)) (a6 + 1 / k) (a7 + 1 / (k*k)) (a8 + alt / k) (a9 + alt / (2 * k - 1)) where x ! y = x `seq` y
Here the first guard is purely a syntactic trick to inform ghc that the arguments should be strictly evaluated. I've played a little game here, using ! for `seq` is reminiscent of the new bang-pattern proposal for strictness. Let's see how this compiles. Strictifying all args GHC produces an inner loop of:
$sloop_r2WS :: GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> [GHC.Float.Double]
Ah! perfect. Let's see how that runs:
$ ghc Naive.hs -O2 -no-recomp $ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 4.45s user 0.02s system 99% cpu 4.482 total
Crank up the gcc flags
Not too bad. No space leak and quite zippy. But let's see what more can be done. First, double arithmetic usually (always?) benefits from -fexcess-precision, and cranking up the flags to gcc:
paprika$ ghc Naive.hs -O2 -fexcess-precision -optc-O3 -optc-ffast-math -no-recomp paprika$ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.71s user 0.01s system 99% cpu 3.726 total
Even better! Now, let's dive into the Core to see if there are any optimisation opportunites that GHC missed. So add -ddump-simpl and peruse the output.
Common subexpressions
Looking at the Core, I see firstly that some of the common subexpressions haven't been factored out:
case [GHC.Float.Double] GHC.Prim./## 1.0 (GHC.Prim.*## (GHC.Prim.*## (GHC.Prim.*## (GHC.Prim.*## sc10_s2VS sc10_s2VS) sc10_s2VS) (GHC.Prim.sinDouble# sc10_s2VS)) (GHC.Prim.sinDouble# sc10_s2VS))
Multiple calls to sin. Hmm... And similar for cos and k*k. Simon Peyton-Jones says:
GHC doesn't do full CSE. It'd be a relatively easy pass for someone to add, but it can cause space leaks. And it can replace two strictly-evaluated calls with one lazy thunk: let { x = case e of ....; y = case e of .... } in ... ==> let { v = e; x = case v of ...; y = case v of ... } in ...
Instead GHC does "opportunistic CSE". If you have let x = e in .... let y = e in .... then it'll discard the duplicate binding. But that's very weak.
So it looks like we might have to float out the commmon subexpressions by hand. The inner loop now looks like:
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k3 * sk * sk)) (a5 + 1 / (k3 * ck * ck)) (a6 + 1 / k) (a7 + 1 / k2) (a8 + alt / k) (a9 + alt / (2 * k - 1)) where sk = sin k ck = cos k k2 = k * k k3 = k2 * k x ! y = x `seq` y
looking at the Core shows the sins are now allocated and shared:
let a9_s2MI :: GHC.Prim.Double# a9_s2MI = GHC.Prim.sinDouble# sc10_s2Xa
So the common expressions are floated out, and it now runs:
paprika$ time ./a.out 2500000 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.29s user 0.00s system 99% cpu 3.290 total
Faster. So we gained 12% by floating out those common expressions.
Strength reduction
Finally, another trick -- manual strength reduction. When I checked the C entry, it used an integer for the k parameter to the loop, and cast it to a double for the math each time around, so perhaps we can make it an Int parameter. Secondly, the alt parameter only has it's sign flipped each time, so perhaps we can factor out the alt / k arg (it's either 1 / k or -1 on k), saving a division. Thirdly, (k ** (-0.5)) is just a slow way of doing a sqrt.
The final loop looks like:
loop i n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | i !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined -- strict | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (i+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + 1 / sqrt k) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k3 * sk * sk)) (a5 + 1 / (k3 * ck * ck)) (a6 + dk) (a7 + 1 / k2) (a8 + alt * dk) (a9 + alt / (2 * k - 1)) where k3 = k2*k; k2 = k*k; dk = 1/k; k = fromIntegral i :: Double sk = sin k; ck = cos k; x!y = x`seq`y
Checking the generated C code (for another tutorial, perhaps) shows that the same C operations are generated as the C entry uses.
And it runs:
$ time ./i 2500000 3.000000200 (2/3)^k 3186.765000000 k^-0.5 0.999852700 1/k(k+1) 30.314493000 Flint Hills 42.995068000 Cookson Hills 15.403683000 Harmonic 1.644725300 Riemann Zeta 0.693137470 Alternating Harmonic 0.785399100 Gregory ./i 2500000 2.37s user 0.01s system 99% cpu 2.389 total
A big speedup!
This entry in fact runs faster than hand optimised (and vectorised) GCC! And is only slower than optimised Fortran. Lesson: Haskell can be very, very fast.
So, by carefully tweaking things, we first squished a space leak, and then gained another 45%.
Summary
- Manually inspect the Core that is generated
- Use strictness annotations to ensure loops are unboxed
- Watch out for optimisations such as CSE and strength reduction that are missed
- Read the generated C for really tight loops.
- Use -fexcess-precision and -optc-ffast-math for doubles
Pattern matching
On rare occasions pattern matching can give improvements in code that needs to repeatedly take apart data structures. This code:
flop :: Int -> [Int] -> [Int] flop n xs = rs where (rs, ys) = fl n xs ys fl 0 xs ys = (ys, xs) fl n (x:xs) ys = fl (n-1) xs (x:ys)
Can be rewritten to be faster (and more ugly) as:
flop :: Int -> [Int] -> [Int] flop 2 (x1:x2:xs) = x2:x1:xs flop 3 (x1:x2:x3:xs) = x3:x2:x1:xs flop 4 (x1:x2:x3:x4:xs) = x4:x3:x2:x1:xs flop 5 (x1:x2:x3:x4:x5:xs) = x5:x4:x3:x2:x1:xs flop 6 (x1:x2:x3:x4:x5:x6:xs) = x6:x5:x4:x3:x2:x1:xs flop 7 (x1:x2:x3:x4:x5:x6:x7:xs) = x7:x6:x5:x4:x3:x2:x1:xs flop 8 (x1:x2:x3:x4:x5:x6:x7:x8:xs) = x8:x7:x6:x5:x4:x3:x2:x1:xs flop 9 (x1:x2:x3:x4:x5:x6:x7:x8:x9:xs) = x9:x8:x7:x6:x5:x4:x3:x2:x1:xs flop 10 (x1:x2:x3:x4:x5:x6:x7:x8:x9:x10:xs) = x10:x9:x8:x7:x6:x5:x4:x3:x2:x1:xs flop n xs = rs where (rs, ys) = fl n xs ys fl 0 xs ys = (ys, xs) fl n (x:xs) ys = fl (n-1) xs (x:ys)
Memory allocation and arrays
When you are allocating arrays, it may help to know a little about GHC's memory allocator. There are lots of deatils in The GHC Commentary), but here are some useful facts:
- For larger objects ghc has an allocation granularity of 4k. That is it always uses a multiple of 4k bytes, which can lead to wasteage of up to 4k per array. Furthermore, a byte array has some overhead: it needs one word for the heap cell header and another for the length. So if you allocate a 4k byte array then it uses 8k. So the trick is to allocate 4k - overhead. This is what the Data.ByteString library does
- GHC allocates memory from the OS in units of a "megablock", currently 1Mbyte. So if you allocate a 1Mb array, the storage manager has to allocate 1Mb + overhead, which will cause it to allocate a 2Mb megablock. The surplus will be returned to the system in the form of free blocks, but if all you do is allocate lots of 1Mb arrays, you'll waste about half the space because there's never enough contiguous free space to contain another 1Mb array. Similar problem for 512k arrays: the storage manager allocates a 1Mb block, and returns slightly less than half of it as free blocks, so each 512k allocation takes a whole new 1Mb block.
Rewrite rules
Algebraic properties in your code might be missed by the GHC optimiser. You can use user-supplied rewrite rules to teach the compiler to optimise your code using domain-specific optimisations.