Difference between revisions of "Performance/Data types"

From HaskellWiki
Jump to navigation Jump to search
 
m (Perforamnce:Data Types moved to Performance:Data Types)
(No difference)

Revision as of 13:31, 10 January 2006

Newtypes

If your datatype has a single constructor with a single field, use a newtype declaration instead of a data declaration. The newtype constructor will be optimised away in most cases.

In fact, newtypes usually cost nothing, so they can be used to document your code for free.

Strict fields

The strictness annotation on constructor fields is used mainly to avoid space leaks. eg. in:

 data T = T !Int !Int

we know that neither component of the T constructor can harbour a space leak, because both components must be fully evaluated to Ints when the constructor is built.

However, bear in mind that strictness annotations can make performance worse. A strictness annotation forces the compiler to ensure that the field is fully evaluated before building the constructor, and if it turns out that the field was already evaluated then this is just wasted work. Clever compilers attempt to detect when this is the case and optimise away the wasted evaluation, but it is hard to do this well (GHC does a poor job, currently).

GHC-specific techniques

Single-constructor datatypes

GHC loves single-constructor datatypes, such as tuples. A single-constructor datatype can be unpacked when it is passed to a strict function. For example, given this function:

 f (x,y) = ...

GHC's strictness analyser will detect that f is strict in its argument, and compile the function like this:

 f z = case z of (x,y) -> f' x y
 f' x y = ...

where f is called the wrapper, and f' is called the worker. The wrapper is inlined everywhere, so for example if you had a call to f like this:

 ... f (3,4) ...

this will end up being compiled to

 ... f' 3 4 ...

and the tuple has been completely optimised away.

This only happens when the function argument is a single-constructor type. Note that basic types like Int and Char count as single-constructor types, but not Integer and Bool.

Unpacking strict fields

This is one of the most powerful techniques you can use to optimise data structures. When a constructor field is marked strict, and it is a single-constructor type, then it is possible to ask GHC to unpack the contents of the field directly in its parent. For example, given this:

 data T = T {-# UNPACK #-} !(Int,Float)

GHC will represent the type T like this:

 data T = T Int Float

eliminating the tuple. This is commonly used to put unboxed Ints directly in a constructor:

 data T = T {-# UNPACK #-} !Int

will be represented as

 data T = T Int#

where Int# is the unboxed integer type. You don't have to mention Int# in your program - just putting the {-# UNAPCK #-} directive on the field is enough to tell GHC that this is the representation you want, and GHC will eliminate the boxing.

Note that {-# UNPACK #-} isn't the default, for the reason that it isn't always a good idea. If there is a pattern match on a constructor with an unpacked field, and the value of that field is passed to a non-strict function, GHC has to re-box the value before passing it on. If this re-boxing is common, then unpacking can be slower than not unpacking. The effect can be more acute if the type being unpacked has a lot of components (eg. a 17-tuple).

For more information, see The UNPACK pragma in GHC's User Guide.

Enumerations

Enumerations don't count as single-constructor types as far as GHC is concerned, so they don't benefit from unpacking when used as strict constructor fields, or strict function arguments. This is a defficiency in GHC, but it can be worked around.

One way to work around it is to declare your enumeration as an int-like type, instead of the usual enumeration type. For example, if you have

 data Color = Red | Blue | Black

you could write

 newtype Color = Color Int deriving (Eq,Ord,Enum)
 (red:blue:black:_) = [Color 1 ..]

and now Color is a single-constructor unpackable datatype. However, we lose pattern matching; instead of

 f Red = ...

we have to write

 f color | color == red = ...

There's no performance penalty for this, though.

Also, you have to hide the representation of Color appropriately so that invalid values can't be constructed (Color 99).