GHC optimisations
Introduction[edit]
This page collects together information about the optimisations that GHC does and does not perform.
- GHC experts: Please check that the info in this page is correct.
- Everybody else: Feel free to add questions!
General optimisations[edit]
Dead code elimination[edit]
Does GHC remove code that you're not actually using?
Yes and no. If there is something in a module that isn't exported and isn't used by anything that is exported, it gets ignored. (This makes your compiled program smaller.) So at the module level, yes, GHC does dead code elimination.
On the other hand, if you import a module and use just 1 function from it, all of the code for all of the functions in that module get linked in. So in this sense, no, GHC doesn't do dead code elimination.
(There is a switch to make GHC spit out a separate object file for each individual function in a module. If you use this, only the functions are actually used will get linked into your executable. But this tends to freak out the linker program...)
If you want to be warned about unused code (Why do you have it there if it's unused? Did you forget to type something?) you can use the -fwarn-unused-binds
option (or just -Wall
).
Common subexpression elimination[edit]
First of all, common subexpression elimination (CSE) means that if an expression appears in several places, the code is rearranged so that the value of that expression is computed only once. For example:
foo x = (bar x) * (bar x)
might be transformed into
foo x = let x' = bar x in x' * x'
thus, the bar
function is only called once. (And if bar
is a particularly expensive function, this might save quite a lot of work.)
GHC doesn't actually perform CSE as often as you might expect. The trouble is, performing CSE can affect the strictness/laziness of the program. So GHC does do CSE, but only in specific circumstances --- see the GHC manual. (Section??)
Long story short: "If you care about CSE, do it by hand."
Inlining[edit]
Inlining is where a function call is replaced by that function's definition. For example, the standard map
function can be defined as
map :: (a -> b) -> [a] -> [b]
map f [] = []
map f (x:xs) = f x : map f xs
Now if you write something like
foo = map bar
it's possible that the compiler might inline the definition of map
, yielding something like
foo [] = []
foo (x:xs) = bar x : foo xs
which is (hopefully!) faster, because it doesn't involve a call to the map
function any more, it just does the work directly. (This might also expose new optimisations opportunities; map
works for any types, whereas foo
probably works for only one type.)
Inlining can be forced with the INLINE pragma:
{-# INLINE key_function #-}
Inlining can be prevented with the NOINLINE pragma:
{-# NOINLINE key_function #-}
Use the compile option -funfolding-use-threshold<n> to set the size of functions to inline.
So, that's what inlining is. By default, GHC will inline things if they are 'small enough'. Every time you inline a function, you are in a sense making a (customised) copy of that function. Do too much of this and the compiled program will be enormous. So it's only worth it for 'small' functions.
Specialisation[edit]
Flexibility is the enemy of performance. Take (+)
for example. As you know, it adds two numbers together. However, would that be two integers? Two floating-point numbers? Two complex numbers? Two vectors? The generated machine code is very, very different in each case!
It's easy enough to make a function such as sum
, which will work for any type of number. However, in the interests of performance, if it can be determined exactly which type of number we're going to be working on, the compiler can generate exactly the right machine code, without having to do lots of runtime lookups.
GHC tries to do this where possible. However (as I understand it?) this tends to work less well across module boundaries. For example, suppose you write
module Physics where
data Force = ...
instance Num Force where ...
resultant_force :: [Force] -> Force
resultant_force = sum
One might hope that resultant_force
would get compiled using a special version of sum
tailored to adding up only Force
objects. This may or may not happen.
Generally GHC won't just take an existing function and recompile it with a new type signature. What might happen is that the function gets inlined, and specialised from there. (Can someone say something more concrete here?)
Specialization of the function resultant_force
can be forced with the SPECIALIZE pragma:
{-# SPECIALIZE resultant_force :: [Int] -> Int #-}
Strictness analysis[edit]
Haskell is a lazy language. Calculations are notionally not performed until their results are 'needed'. However, if the result definitely will be needed, it's a waste of time and effort to save up the expression and execute it later; more efficient to just execute it right now.
Strictness analysis is a process by which GHC attempts to determine, at compile-time, which data definitely will 'always be needed'. GHC can then build code to just calculate such data, rather than the normal (higher overhead) process for storing up the calculation and executing it later.
Unfortunately, looking at a program and saying "will this data be needed?" is a bit like looking at a program and saying "this program will never halt" --- see The Halting Problem. But GHC does its best, and can give big speedups in some cases.
Fusion[edit]
In Haskell, it is common to write expressions such as
foo = length . filter p . map g . map f . concat
This style of writing makes it very clear what the function does (it takes a list of lists, concatenates them all, applies f to every element, applies g to every element, throws away all elements that fail p, and then calculates the length of the result). However, if executed literally, it's very inefficient.
When executed, concat
takes a list of lists and constructs a flat list. Then map
constructs another list. Then the second map
function creates yet another list...
Since Haskell is a lazy language, these intermediate lists never exist in memory in their entirety. One element will be generated by one function, and then immediately consumed by the next function in the chain. So as each element is generated, it instantly becomes garbage. So the memory usage isn't that great, but the GC load is quite high. (Not to mention all the time wasted on creating thunks, evaluating thunks, and allocating/deallocating RAM.) So we really want to avoid all this!
The term fusion refers to program transformations aimed at removing intermediate data structures. (Deforestation refers specifically to lists, but in general fusion is applicable to operations on any structure.)
The standard libraries provide a function concatMap
such that
concatMap f = concat . map f
As you can see, we don't 'need' this function --- we can define it in turns of other, simpler functions. However, it's more efficient to run because it doesn't generate an intermediate list of lists. (It's also used to define the list monad, which is probably why it's there.)
Having concatMap
is nice. But we really don't want to define new functions for every possible combination of list operators. (Do you fancy implementing a lengthFilterMapMapConcat
function?) So one of the optimisations that GHC performs is to attempt to perform fusion automatically.
One way that we could try to do this is by inlining all the function definitions. But list processing functions are generally recursive, which makes matters rather complicated. (I.e., this doesn't really work.)
Currently (GHC 6.6.1) we have build/foldr fusion. That is, where a function builds a list and passes the result to a function that consumes a list, GHC can (usually) elide the list itself. There are also other transformations that can be applied. For example, map fusion. Map fusion simply says that
map g . map f
is equivalent to
map (g . f)
(but the latter is more efficient).
All of this is implemented using GHC's transformation rules facility. This functionality is only turned on with -O or -O2.
In the future (GHC 6.7?) we will have stream fusion. In layman's terms, this increases the number of functions that can be fused = big speedups.
To be more technical, a stream represents a traversal of a list (or, indeed, some other structure such as an array). All the list functions become stream functions --- but, crucially, stream operations are non-recursive, meaning they can all be glued together. Taking our example above:
foo = length . filter p . map g . map f . concat
becomes something like
foo =
length .
fromStream . streamFilter p . toStream .
fromStream . streamMap g . toStream .
fromStream . streamMap f . toStream .
fromStream . concat . toStream
which, obviously, is massively less efficient than the original. However, since
toStream . fromStream = id
we can simplify that down to
foo =
length .
fromStream . streamFilter p .
streamMap g .
streamMap f .
streamConcat . toStream
In other words, we have a toStream
at one end, and a fromStream
at the other end, with a bunch of stream operations in the middle. These are all non-recursive; onto fromStream
actually performs a recursive loop, so once GHC does all its inlining we'll end up with something like
foreach x in xs do
... concat ...
... map f ...
... map g ...
... filter p ...
... length ...
which is what we want.
Execution Model[edit]
In order to understand how to write efficient code, and what GHC does with your code to optimise it, it helps to know a bit about what your compiled code looks like and how it works.
Graph reduction[edit]
To a first approximation, at any moment your program is a 'graph' of objects in memory. ('Graph' in the graph theory sense --- nodes connected by arcs.) Some of the objects are 'data' --- booleans, integers, strings, lists, etc. Some of those objects are functions (because Haskell lets you pass functions around like data). And some of these are thunks --- unevaluated expressions (because Haskell only evaluates expressions 'as needed').
The program starts off with a single node representing the unevaluated call to main
, and proceeds to execute from there. Each time a thunk is executed, the result (whatever it is) overwrites the thunk data. (It's possible that the result of evaluating a thunk is a new thunk of course.)
About STG[edit]
GHC compiles to the spineless tagless G-machine (STG). This is a notional graph reduction machine (i.e., a virtual machine that performs graph reductions as described above). 'G-machine' because it does graph reduction. 'Spineless' because it can't stand up to bullies. 'Tagless' because the graph nodes don't have 'tags' on them to say what they are.
Instead of tags, the nodes have access pointers. If the node is a thunk, its pointer points to the code to evaluate the thunk and return the real result. Otherwise the pointer points to some 'do-nothing' code. So to access any type of node, you just do an indirect jump on this pointer; no case analysis is necessary.
(Gosh I hope I got that lot right!)
Internally, GHC uses a kind of 'machine code' that runs on this non-existent G-machine. It does a number of optimisations on that representation, before finally compiling it into real machine code (possibly via C using GCC).
STG optimisations[edit]
There are a number of optimisations done at the STG level. These mainly involve trying to avoid unnecessary steps. For example, avoid creating a thunk which immediately creates another thunk when executed; make it evaluate all the way down to a final result in one go. (If we 'need' the thunk's value, we're going to evaluate all the way down anyway, so let's leave out the overhead...)
Primitive data types[edit]
Haskell-98 provides some standard types such as Int
, etc. GHC defines these as 'boxed' versions of GHC-specific 'unboxed' types:
-- From GHC.Exts:
data Int = I# Int#
data Word = W# Word#
data Double = D# Double#
-- etc.
Here Int#
is a GHC-specific internal type representing, literally, a plain ordinary bundle of 32 or 64 bits inside the computer somewhere. (Depending on whether it's a 32 or 64-bit architecture.)
In particular, a Int#
is strict, whereas a Int
isn't.
Algebraic data types[edit]
(I'm not sure about the basic memory layout. Somebody fill in the general case?)
There are a few special cases:
Types with 1 constructor[edit]
If a function returns a tuple of values, and the caller immediately takes the tuple apart again, GHC will attempt to eliminate the tuple completely at the machine code level. Actually, this works for all types having exactly 1 constructor.
Constructors with no fields[edit]
Booleans are a good example:
data Bool = False | True
GHC will construct a single object in memory representing False
, and another representing True
. All Bool
values are thus pointers to one or the other of these objects. (And hence, consume either 32 or 64 bits.)