Editing Maintaining laziness (section)

== Laziness breakers ==

=== Maybe, Either, Exceptions ===

Some laziness breakers are visible in type signatures:
<haskell>
decodeUTF8 :: [Word8] -> Either Message String
</haskell>
The <hask>Either</hask> type signals that the function marks decoding-failure by using the <hask>Left</hask> constructor of <hask>Either</hask>.
This function cannot be lazy, because when you access the first character of the result, it must already be computed, whether the result is <hask>Left</hask> or <hask>Right</hask>.
For this decision, the complete input must be decoded.
A better type signature is
<haskell>
decodeUTF8 :: [Word8] -> (Maybe Message, String)
</haskell> 
where the <hask>String</hask> contains as much characters as could be decoded and <hask>Maybe Message</hask> gives the reason for the stop of the decoding.
<hask>Nothing</hask> means the input was completely read, <hask>Just msg</hask> means the decoding was aborted for the reason described in <hask>msg</hask>.
If you touch the first element of the pair, the complete decodings is triggered, thus laziness is broken.
This means you should first process the <hask>String</hask> and look at <hask>Maybe Message</hask> afterwards.

Instead of the unspecific pair type you should use the special type for asynchronous exceptions as found in the [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/explicit-exception explicit exception] package.

Especially in parsers you may find a function, called Wadler's force function.
It works as follows:
<haskell>
force y =
   let Just x = y
   in  Just x
</haskell>
It looks like a complicated expression for <hask>y</hask> with an added danger of failing unrecoverably when <hask>y</hask> is not <hask>Just</hask>.
Its purpose is to use the lazy pattern matching of <hask>let</hask> and to show to the runtime system, that we expect that <hask>y</hask> is always a <hask>Just</hask>.
Then the runtime system does not need to wait until it can determine the right constructor but it can proceed immediately.
This way, a function can be made lazy, also if it returns <hask>Maybe</hask>.
It can however fail, if later it turns out, that <hask>y</hask> is actually <hask>Nothing</hask>. <!-- fail how? To be lazy? Or it is some hideous failure like 'head []'? -->

Using force-like functions is sometimes necessary, but should be avoided for data types with more than one constructor.
It is better to use an interim data type with one constructor and lift to the multi-constructor datatype when needed.
Consider parsers of type <hask>StateT [Word8] Maybe a</hask>.
Now consider the parser combinator <haskell>many :: StateT [Word8] Maybe a -> StateT [Word8] Maybe [a]</haskell>
which parses as many elements of type <hask>a</hask> as possible.
It shall be lazy and thus must be infallible and must not use the <hask>Maybe</hask>.
It shall just return an empty list, if parsing of one element fails.
A quick hack would be to define <hask>many</hask> using a <hask>force</hask> function.
It would be better to show by the type, that <hask>many</hask> cannot fail:
<haskell>many :: StateT [Word8] Maybe a -> StateT [Word8] Identity [a]</haskell>.

=== Early decision ===

==== List construction ====

Be aware that the following two expressions are not equivalent.
<haskell>
-- less lazy
if b then f x else f y
-- more lazy
f (if b then x else y)
</haskell>
It is <hask>if undefined then f x else f y</hask> is <hask>undefined</hask>, whereas <hask>f (if b then x else y)</hask> is <hask>f undefined</hask>, which is a difference in [[non-strict semantics]].
Consider e.g. <hask>if b then 'a':x else 'a':y</hask>.

It is common source of too much strictness to make decisions too early and thus duplicate code in the decision branches.
Intuitively spoken, the bad thing about [[code duplication]] (stylistic questions put aside) is, that the run-time system cannot see that in the branches, some things are equal and do it in common before the critical decision.
Actually, the compiler and run-time system could be "improved" to do so, but in order to keep things predictable, they do not do so.
Even more, this behaviour is required by theory, since by pushing decisions to the inner of an expression you change the semantics of the expression.
So we return to the question, what the programmer actually wants.

Now, do you think this expression
<haskell>
if b
  then [x]
  else y:ys
</haskell>
is maximally lazy?
It seems so, but actually it is not. In both branches we create non-empty lists, but the run-time system cannot see this.
It is <hask>null (if undefined then [x] else y:ys)</hask> again <hask>undefined</hask>, but we like to have it evaluated to <hask>False</hask>.
Here we need lazy pattern matching as provided by <hask>let</hask>.
<haskell>
let z:zs =
      if b
        then [x]
        else y:ys
in  z:zs
</haskell>
This expression always returns the constructor <hask>(:)</hask> and thus <hask>null</hask> knows that the list is not empty.
However, this is a little bit unsafe, because the <hask>let z:zs</hask> may fail if in the branches of <hask>if</hask> there is an empty list.
This error can only caught at run-time which is bad.
We can avoid it using the single constructor pair type.
<haskell>
let (z,zs) =
      if b
        then (x,[])
        else (y,ys)
in  z:zs
</haskell>
which can be abbreviated to
<haskell>
uncurry (:) (if b then (x,[]) else (y,ys))
</haskell>

Another example is the <hask>inits</hask> function.
In the Haskell 98 report the implementation
<haskell>
inits        :: [a] -> [[a]]
inits []     = [[]]
inits (x:xs) = [[]] ++ map (x:) (inits xs)
</haskell>
is suggested.
However you find that <hask>inits undefined</hask> is undefined, although <hask>inits</hask> always should return the empty list as first element.
The following implementation does exactly this:
<haskell>
inits :: [a] -> [[a]]
inits xt =
   [] :
   case xt of
      [] -> []
      x:xs -> map (x:) (inits xs)
</haskell>
See also the article on [[base cases and identities]].


==== Reader-Writer-State monad ====

I do not know whether the following example can be simplified.
In this form it occurred in a real application, namely the HTTP package.

Consider the following action of the <hask>Control.Monad.RWS</hask> which fetches a certain number of elements from a list.
The state of the monad is the input list we fetch the elements from.
The reader part provides an element which means that the input is consumed.
It is returned as singleton when the caller tries to read from a completely read input.
The writer allows to log some information, however the considered action does not output anything to the log.
<haskell>
getN :: Int -> RWS a [Int] [a] [a]
getN n =
   do input <- get
      if null input
        then asks (:[])
        else let (fetched,rest) = splitAt n input
             in  put rest >> return fetched
</haskell>
As we learned as good imperative programmers, we only call <hask>splitAt</hask> when the input is non-empty, that is, only if there is something to fetch.
This works in even more many corner cases, but not in the following one.
Although <hask>getN</hask> does obviously not log something (i.e. it does not call <hask>tell</hask>), it requires to read the input in order to find out, that nothing was logged:
<haskell>
*Test> (\(_a,_s,w) -> w) $ runRWS (getN 5) '\n' undefined
*** Exception: Prelude.undefined
</haskell>

The problem is again, that <hask>if</hask> checks the emptiness of the input, which is undefined, since the input is undefined.
Thus we must ensure, that the invoked monadic actions are run independent from the input.
Only this way, the run-time system can see that the logging stream is never touched.
We start refactoring by calling <hask>put</hask> independently from <hask>input</hask>'s content.
It works as well for empty lists, since <hask>splitAt</hask> will just return empty lists in this case.
<haskell>
getN :: Int -> RWS a [Int] [a] [a]
getN n =
   do input <- get
      let (fetched,rest) = splitAt n input
      put rest
      if null input
        then asks (:[])
        else return fetched
</haskell>
This doesn't resolve the problem. There is still a choice between <hask>asks</hask> and <hask>return</hask>.
We have to pull out <hask>ask</hask> as well.
<haskell>
getN :: Int -> RWS a [Int] [a] [a]
getN n =
   do input <- get
      let (fetched,rest) = splitAt n input
      put rest
      endOfInput <- ask
      return $
         if null input
           then [endOfInput]
           else fetched
</haskell>
Now things work as expected:
<haskell>
*Test> (\(_a,_s,w) -> w) $ runRWS (getN 5) '\n' undefined
[]
</haskell>
We learn from this example, that sometimes in Haskell it is more efficient to call functions that are not needed under some circumstances.
Always remind, that the [[Do notation considered harmful|do notation]] looks only imperative, but it is not imperative.
E.g., <hask>endOfInput</hask> is only evaluated if the end of the input is really reached.
Thus, the call <hask>ask</hask> does not mean that there is actually an action performed between <hask>put</hask> and <hask>return</hask>.


=== Strict pattern matching in a recursion ===

Consider the <hask>partition</hask> function which sorts elements, that match a predicate, into one list and the non-matching elements into another list.
This function should also work on infinite lists, but the implementation shipped with GHC up to 6.2 [http://www.haskell.org/pipermail/libraries/2004-October/002645.html failed on infinite lists].
What happened?

The reason was that pattern matching was too strict.

Let's first consider the following correct implementation:
<haskell>
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition p =
   foldr
      (\x ~(y,z) ->
         if p x
           then (x : y, z)
           else (y, x : z))
      ([],[])
</haskell>
The usage of <hask>foldr</hask> seems to be reserved for advanced programmers.
Formally <hask>foldr</hask> runs from the end to the start of the list.
However, how can this work if there is a list without an end?
That can be seen when applying the definition of <hask>foldr</hask>.
<haskell>
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr _ b [] = b
foldr f b (a:as) = f a (foldr f b as)
</haskell>
Now we expand this once for an infinite input list, we get
<haskell>
partition p (a:as) =
   (\ ~(y,z) -> if p a then (a:y, z) else (y, a:z)) (foldr ... ([],[]) as)
</haskell>
We see that the whether <hask>a</hask> is prepended to the first or the second list, does only depend on <hask>p a</hask>, and neither on <hask>y</hask> nor on <hask>z</hask>.
The laziness annotation <hask>~</hask> is crucial, since it tells, intuitively spoken,
that we can rely on the recursive call of <hask>foldr</hask> to return a pair and not <hask>undefined</hask>.
Omitting it, would require the evaluation of the whole input list before the first output element can be determined.
This fails for infinite lists and is inefficient for finite lists, and that was the bug in former implementations of <hask>partition</hask>.
Btw. by the expansion you also see, that it would not help to omit the tilde and apply the above 'force' trick to the 'if-then-else' expression.

However, there still remains a small laziness break:
There is an unnecessary decision between the pair constructor of the initial accumulator value <hask>([],[])</hask> and the pair constructors within the <hask>if</hask>.
This can only be avoided by applying a <hask>force</hask> function to the result of <hask>foldr</hask>:
<haskell>
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition p =
   (\ ~(ys,zs) -> (ys,zs)) .
   foldr
      (\x ~(y,z) ->
         if p x
           then (x : y, z)
           else (y, x : z))
      ([],[])
</haskell>


=== List reversal ===

Any use of the list function <hask>reverse</hask> should alert you, since when you access the first element of a reversed list, then all nodes of the input list must be evaluated and stored in memory.
Think twice whether it is really needed.
The articles on [[Infinity and efficiency]] and [[List traversal]]
show how to avoid list reversal.