|Haskell Performance Resource|
Welcome to the Haskell Performance Resource, the collected wisdom on how to make your Haskell programs go faster.
In most cases it is possible to write a Haskell program that performs as well as, or better than, the same program written in [insert language here]. There's a big caveat though: you may have to modify your code significantly in order to improve its performance. Compilers such as GHC are good at eliminating layers of abstraction, but they aren't perfect, and often need some help.
There are many non-invasive techniques: compiler options, for example. Then there are techniques that require adding some small amounts of performance cruft to your program: strictness annotations, for example. If you still don't get the best performance, though, it might be necessary to resort to larger refactorings.
Sometimes the code tweaks required to get the best performance are non-portable, perhaps because they require language extensions that aren't implemented in all compilers (e.g. unboxing), or because they require using platform-specific features or libraries. This might not be acceptable in your setting.
If the worst comes to the worst, you can always write your critical code in C and use the FFI to call it. Beware of the boundaries though - marshaling data across the FFI can be expensive, and multi-language memory management can be complex and error-prone. It's usually better to stick to Haskell if possible.
The key tool to use in making your Haskell program run faster is profiling. Profiling is provided by GHC and nhc98. There is no substitute for finding where your program's time/space is really going, as opposed to where you imagine it is going.
Another point to bear in mind: By far the best way to improve a program's performance dramatically is to use better algorithms. Once profiling has thrown the spotlight on the guilty time-consumer(s), it may be better to re-think your program than to try all the tweaks listed below.
Another extremely efficient way to make your program snappy is to use library code that has been Seriously Tuned By Someone Else. You might be able to write a better sorting function than the one in Data.List, but it will take you much longer than typing import Data.List.
We have chosen to organise the rest of this resource first by Haskell construct (data types, pattern matching, integers), and then within each category to describe techniques that apply across implementations, and also techniques that are specific to a certain Haskell implementation (e.g. GHC). There are some implementation-specific techniques that apply in general - those are linked from the General Implementation-Specific Techniques section below.
- Data Types
- Floating Point
Compiler specific techniques
- There are plenty of good examples of Haskell code written for performance in the The Computer Language Shootout Benchmarks
- And many alternatives, with discussion, on the old Haskell wiki
Specific comparisons of data structures
Data.Sequence vs. lists
Data.Sequence has complexity O(log(min(i,n-i))) for access, insertion and update to position i of a sequence of length n.
List has complexity O(i).
List is a non-trivial constant-factor faster for operations at the head (cons and head), making it a more efficient choice for stack-like and stream-like access patterns. Data.Sequence is faster for every other access pattern, such as queue and random access.
See the following program for proof:
import Data.Sequence insert_million 0 sequence = sequence insert_million n sequence = insert_million (n - 1)(sequence |> n) main = putStrLn (show (Data.Sequence.length (insert_million 1000000 empty)))
$ ghc -O2 --make InsertMillionElements.hs && time ./InsertMillionElements +RTS -K100M 1000000 real 0m7.238s user 0m6.804s sys 0m0.228s
insert_million 0 list = reverse list insert_million n list = insert_million (n -1) (n:list) main = putStrLn (show (length (insert_million 1000000 )))
$ ghc -O2 --make InsertMillionElements.hs && time ./InsertMillionElementsList +RTS -K100M 1000000 real 0m0.588s user 0m0.528s sys 0m0.052s
Lists are substantially faster on this micro-benchmark.
A sequence uses between 5/6 and 4/3 times as much space as the equivalent list (assuming an overhead of one word per node, as in GHC). If only deque operations are used, the space usage will be near the lower end of the range, because all internal nodes will be ternary. Heavy use of split and append will result in sequences using approximately the same space as lists. In detail:
- a list of length n consists of n cons nodes, each occupying 3 words.
- a sequence of length n has approximately n/(k-1) nodes, where k is the average arity of the internal nodes (each 2 or 3). There is a pointer, a size and overhead for each node, plus a pointer for each element, i.e. n(3/(k-1) + 1) words.
- Use strict returns ( return $! ...) unless you absolutely need them lazy.
- Profile, profile, profile - understand who is hanging on to the memory (+RTS -hc) and how it's being used (+RTS -hb).
- Use +RTS -p to understand who's doing all the allocations and where your time is being spent.
- Approach profiling like a science experiment - make one change, observe if anything is different, rollback and make another change - observer the change. Keep notes!