How to profile a Haskell program
- Note: I gave up halfway and moved on to other things. Feel free to take over and steer this article into something useful
The case study
I have a script that converts from an XML format to some pickled data structures via Data.Binary. The XML part is generated by HaXml's DtdToHaskell. On a 54M XML file, the thing swaps like crazy and takes several hours. I would like to improve the situation.
Preliminaries
Enable profiling on libraries
For example, my script uses HaXmL, which uses a library called polyparse:
cd polyparse runhaskell Setup.hs configure --enable-library-profiling runhaskell Setup.hs build sudo runhaskell Setup.hs install cd ..
cd HaXml runhaskell Setup.hs configure --enable-library-profiling runhaskell Setup.hs build sudo runhaskell Setup.hs install
When they are done building, you should notice output like:
ar: creating archive dist/build/libHSpolyparse-1.0.a ar: creating archive dist/build/libHSpolyparse-1.0_p.a
The _p
file is the library with profiling information. Note that the non-profiling one is also created and installed, so you don't have to worry about this slowing down your regular code.
You'll need to do this for every library that you use.
Enable profiling on your stuff
Note that I assume you are using Cabal. If not, see How to write a Haskell program. It's super easy, and you'll be happy you did it.
cd yourProgram runhaskell Setup.hs configure --enable-executable-profiling runhaskell Setup.hs build
No need to install it. We'll be making changes aplenty.
Get toy data
My script takes hours to convert 50M of XML. Running it on such data every time I tweak something would clearly not be a good idea. You want something which is small enough for your program to come back relatively quickly, but large enough to study.
I use something like sed -f makeToy.sed reallyBigFile.xml > toy.xml
where makeToy.sed is a bit of text-hacking to chop off the rest of my data after the arbitrarily chosen item #6621:
/6621/{ c\ </grammar> q }
Test harness
Make things easy on yourself! I find that it's very helpful to automate my way out of my clumsiness. Ideally, each tweak you make to your software should be accompanied by a simple run
and not some long sequence of actions, half of which you might forget. Note: you might also consider using a Makefile instead of a bunch of scripts.
We'll be working with a stable and unstable repository. It's possible that you'll be making a lot of small modifications to your program, so what would be nice is to be able to save some of your modifications along the way. Darcs is very handy for this.
Create a profiling directory
mkdir profiling mv toy.xml profiling
Create a script profiling/setup
#!/bin/sh chmod u+x profiling/setup chmod u+x profiling/run chmod u+x profiling/compare chmod u+x profiling/save runhaskell Setup.lhs configure --enable-executable-profiling
Create a script profiling/run
This script compiles your code, and runs it on some profiling data
#!/bin/sh PROG=geniconvert VIEW=open FLAGS=--yourflags profiling/toydata.xml runhaskell Setup.lhs build dist/build/${PROG}/${PROG} ${FLAGS} +RTS -p -hc -s${PROG}.summary hp2ps ${PROG}.hp ${VIEW} ${PROG}.ps cat ${PROG}.summary
Create a script profiling/compare
Create a script profiling/save
#!/bin/sh darcs push --no-set-default ../perfStable cd ../perfStable profiling/run
Create a stable branch
darcs get yourRepository perfStable cd perfStable sh profiling/setup cd ..
cd yourRepository sh profiling/setup
You should work in the unstable branch (yourRepository). From time to time, you'll want to record your changes and push them into the stable branch. More on this later.
Profiling
- Generate the data, advice on how to scrutinise it (help especially wanted)
Generate the data
This should just be:
profiling/run
Determine what is wrong
Fix your code
See Performance for ideas, especially Performance/GHC if relevant
Run it again
profiling/run
Save the results?
Happy with the direction things are taking?
profiling/save
Go profile again!