Difference between revisions of "Maintaining laziness"

Revision as of 23:55, 28 December 2008

One of Haskell's main features is non-strict semantics, which in is implemented by lazy evaluation in all popular Haskell compilers. However many Haskell libraries found on Hackage are implemented just as if Haskell would be a strict language. This leads to unnecessary inefficiencies, memory leaks and, we suspect, unintended semantics. In this article we want to go through some techniques on how to check lazy behaviour on functions, examples of typical constructs which break laziness without need, and finally we want to link to techniques that may yield the same effect without laziness.

Checking laziness

If you want to check whether a function is lazy enough, you may feed it with undefined values. An undefined value can be undefined, error "reason", or an infinite loop. The latter one has the advantage that it cannot be hidden by some hacks like "catching" the error in the IO monad.

Examples: Check whether filter is lazy:

filter even [0..]
filter even ([0..5] ++ undefined)

If the filter function is lazy then it keeps generating elements in the first case and it outputs a prefix of the output list, before breaking because of the undefined, in the second case.

An automated unit test can check whether infinite or corrupted input data produces correct prefixes. Those tests usually do not fail by returning False but by leading to undefined results, either explicit undefined or an infinite loop.

testFilter0 = filter even [0..100] `isPrefixOf` filter even [0..]
testFilter1 = filter even [0..100] `isPrefixOf` filter even ([0..102]++undefined)
testFilter2 = let x = filter even [0..] !! 100 in x==x
testFilter3 = let x = filter even ([0..102]++undefined) !! 50 in x==x

Laziness breakers

Maybe, Either, Exceptions

Some laziness breakers are visible in type signatures:

decodeUTF8 :: [Word8] -> Either Message String

The Either type signals that the function marks decoding failure by using the Left constructor of Either. This function cannot be lazy, because when you access the first character of the result, it must already be computed, whether the result is Left or Right. For this decision, the complete input must be decoded. A better type signature is

decodeUTF8 :: [Word8] -> (Maybe Message, String)

where the String contains as much characters as could be decoded and Maybe Message gives the reason for the stop of the decoding. Nothing means the input was completely read, Just msg means the decoding was aborted for the reason described in msg. If you touch the first element of the pair, the complete decodings is triggered, thus laziness is broken. This means you should first process the String and look at Maybe Message afterwards.

Instead of the unspecific pair type you should use the special type for asynchronous exceptions as found in the explicit exception package.

Especially in parsers you may find a function, called Wadler's force function. It works as follows:

force y =
   let Just x = y
   in  Just x

It looks like a complicated expression for y with an added danger of failing unrecoverably when y is not Just. It's purpose is to use the lazy pattern matching of let and to show to the runtime system, that we expect that y is always a Just. Then the runtime system need not to wait until it can determine the right constructor but it can proceed immediately. This way a function can be made lazy, also if it returns Maybe. It can however fail, if later it turns out, that y is actually Nothing.

Using force like functions is sometimes necessary, but should be avoided for data types with more than one constructor. It is better to use an interim data type with one constructor and lift to the multi-constructor datatype when needed. Consider parsers of type StateT [Word8] Maybe a.

Now consider the parser combinator

many :: StateT [Word8] Maybe a -> StateT [Word8] Maybe [a]

which parses as many elements of type a as possible. It shall be lazy and thus must be infallible and must not use the Maybe. It shall just return an empty list, if parsing of one element fails. A quick hack would be to define many using a force function. It would be better to show by the type, that many cannot fail:

many :: StateT [Word8] Maybe a -> StateT [Word8] Identity [a]

.

Early decision

Be aware that the following two expression are not equivalent.

-- less lazy
if b then f x else f y
-- more lazy
f (if b then x else y)

It is if undefined then f x else f y is undefined, whereas f (if b then x else y) if f undefined, which is a difference in non-strict semantics. Consider e.g. if b then 'a':x else 'a':y.

It is common source of too much strictness to make decisions too early and thus duplicate code in the decision branches. Intuitively spoken, the bad thing about code duplication (stylistic questions put aside) is, that the run-time system cannot see that in the branches some things are equal and do it in common before the critical decision. Actually, the compiler and run-time system could be "improved" to do so, but in order to keep things predictable, they do not do so. Even more, this behaviour is required by theory, since by pushing decisions to the inner of an expression you change the semantics of the expression. So we return to the question, what the programmer actually wants.

Now, do you think this expression

if b
  then [x]
  else y:ys

is maximally lazy? It seems so, but actually it is not. In both branches we create non-empty lists, but the run-time system cannot see this. It is null (if undefined then [x] else y:ys) again undefined, but we like to have it evaluated to False. Here we need lazy pattern matching as provided by let.

let z:zs =
      if b
        then [x]
        else y:ys
in  z:zs

This expression always returns the constructor (:) and thus null knows that the list is not empty. However, this is a little bit unsafe, because the let z:zs may fail if in the branches of if there is an empty list. This error can only catched at run-time which is bad. We can avoid it using the single constructor pair type.

let (z,zs) =
      if b
        then (x,[])
        else (y,ys)
in  z:zs

which can be abbreviated to

uncurry (:) (if b then (x,[]) else (y,ys))

Strict pattern matching in a recursion

Consider the partition function which sorts elements, that match a predicate, into one list and the non-matching elements into another list. This function should also work on infinite lists, but the implementation shipped with GHC up to 6.2 http://www.haskell.org/pipermail/libraries/2004-October/002645.html failed on infinite lists]. What happened? The reason was a too strict pattern matching.

Let's first consider the following correct implementation:

partition :: (a -> Bool) -> [a] -> ([a], [a])
partition p =
   foldr
      (\x ~(y,z) ->
         if p x
           then (x : y, z)
           else (y, x : z))
      ([],[])

The usage of foldr seems to be reserved for advanced programmers. Formally foldr runs from the end to the start of the list. However, how can this work if there is a list without an end? That can be seen when applying the definition of foldr.

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr _ b [] = b
foldr f b (a:as) = f a (foldr f b as)

Now we expand this once for an infinite input list, we get

partition p (a:as) =
   (\ ~(y,z) -> if p a then (a:y, z) else (y, a:z)) (foldr ... ([],[]) as)

We see that the whether a is prepended to the first or the second list, does only depend on p a, and neither on y nor on z. The laziness annotation ~ is crucial, since it tells, intuitively spoken, that we can rely on the recursive call of foldr to return a pair and not undefined. Omitting it, would require the evaluation of the whole input list before the first output element can be determined. This fails for infinite lists and is inefficient for finite lists, and that was the bug in former implementations of partition. Btw. by the expansion you also see, that it would not help to omit the tilde and apply the above 'force' trick to the 'if-then-else' expression.

List reversal

Any use of the list function reverse should alert you, since when you access the first element of a reversed list, then all nodes of the input list must be evaluated and stored in memory. Think twice whether it is really needed. The article Infinity and efficiency shows how to avoid list reversal.

Alternatives

From the above issues you see that it laziness is a fragile thing. Only one moment where you do not pay attention and a function, carefully developed with laziness in mind, is no longer lazy, when you call it. The type system can almost not help you hunting laziness breakers and there is little support by debuggers. Thus detection of laziness breakers, often requires understanding of a large portion of code, which is against the idea of modularity. Maybe for your case you might prefer a different idiom, that achieves the same goals in a safer way. See e.g. the Enumerator and iteratee pattern.

@@ Line 53: / Line 53: @@
 <hask>Nothing</hask> means the input was completely read,
 <hask>Just msg</hask> means the decoding was aborted for the reason described in <hask>msg</hask>.
+If you touch the first element of the pair, the complete decodings is triggered, thus laziness is broken.
+This means you should first process the <hask>String</hask> and look at <hask>Maybe Message</hask> afterwards.
+Instead of the unspecific pair type you should use the special type for asynchronous exceptions as found in the [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/explicit-exception explicit exception] package.
-Wadler's force function
-The following looks cumbersome:
+Especially in parsers you may find a function, called Wadler's force function.
+It works as follows:
 <haskell>
+force y =
-let (Just x) = y
-in  Just x
+   let Just x = y
+   in  Just x
 </haskell>
-It looks like a complicated expression for <hask>y</hask>,
+It looks like a complicated expression for <hask>y</hask>
 with an added danger of failing unrecoverably when <hask>y</hask> is not <hask>Just</hask>.
+It's purpose is to use the lazy pattern matching of <hask>let</hask>
+and to show to the runtime system, that we expect that <hask>y</hask> is always a <hask>Just</hask>.
+Then the runtime system need not to wait until it can determine the right constructor but it can proceed immediately.
+This way a function can be made lazy, also if it returns <hask>Maybe</hask>.
+It can however fail, if later it turns out, that <hask>y</hask> is actually <hask>Nothing</hask>.
+Using force like functions is sometimes necessary,
-...
+but should be avoided for data types with more than one constructor.
+It is better to use an interim data type with one constructor and lift to the multi-constructor datatype when needed.
-parsers - leave Maybe where no Maybe is required
+Consider parsers of type <hask>StateT [Word8] Maybe a</hask>.
+Now consider the parser combinator <haskell>many :: StateT [Word8] Maybe a -> StateT [Word8] Maybe [a]</haskell>
+which parses as many elements of type <hask>a</hask> as possible.
+It shall be lazy and thus must be infallible and must not use the <hask>Maybe</hask>.
+It shall just return an empty list, if parsing of one element fails.
+A quick hack would be to define <hask>many</hask> using a <hask>force</hask> function.
+It would be better to show by the type, that <hask>many</hask> cannot fail:
+<haskell>many :: StateT [Word8] Maybe a -> StateT [Word8] Identity [a]</haskell>.
 === Early decision ===