Maintaining laziness: Difference between revisions
(hGetContents, unsafeInterleaveIO, LazyIO) |
m (New title and various other small changes to input/output section) |
||
(11 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
One of Haskell's main features is [[non-strict semantics]], which | One of Haskell's main features is [[non-strict semantics]], which is implemented by [[lazy evaluation]] in all popular Haskell compilers. | ||
However many Haskell libraries found on [[Hackage]] are implemented | However many Haskell libraries found on [[Hackage]] are implemented as if Haskell were a strict language. | ||
This leads to unnecessary inefficiencies, [[memory leak]]s and, we suspect, unintended semantics. | This leads to unnecessary inefficiencies, [[memory leak]]s and, we suspect, unintended semantics. | ||
In this article we want to go through some techniques on how to check lazy behaviour on functions, | In this article we want to go through some techniques on how to check lazy behaviour on functions, examples of typical constructs which break laziness without need, and finally we want to link to techniques that may yield the same effect without laziness. | ||
examples of typical constructs which break laziness without need, | |||
and finally we want to link to techniques that may yield the same effect without laziness. | |||
== Checking laziness == | == Checking laziness == | ||
=== manual checks === | |||
If you want to check whether a function is lazy enough, you may feed it with undefined values. | If you want to check whether a function is lazy enough, you may feed it with undefined values. | ||
Line 18: | Line 18: | ||
filter even ([0..5] ++ undefined) | filter even ([0..5] ++ undefined) | ||
</haskell> | </haskell> | ||
If the <hask>filter</hask> function is lazy | If the <hask>filter</hask> function is lazy then it keeps generating elements in the first case and it outputs a prefix of the output list, before breaking because of the undefined, in the second case. | ||
then it keeps generating elements in the first case | |||
and it outputs a prefix of the output list, before breaking because of the undefined, in the second case. | |||
An automated unit test can check whether infinite or corrupted input data produces correct prefixes. | An automated unit test can check whether infinite or corrupted input data produces correct prefixes. | ||
Those tests usually do not fail by returning <hask>False</hask> but by leading to undefined results, | Those tests usually do not fail by returning <hask>False</hask> but by leading to undefined results, either explicit <hask>undefined</hask> or an infinite loop. | ||
either explicit <hask>undefined</hask> or an infinite loop. | |||
<haskell> | <haskell> | ||
testFilter0 = filter even [0..100] `isPrefixOf` filter even [0..] | testFilter0 = filter even [0..100] `isPrefixOf` filter even [0..] | ||
Line 30: | Line 27: | ||
testFilter2 = let x = filter even [0..] !! 100 in x==x | testFilter2 = let x = filter even [0..] !! 100 in x==x | ||
testFilter3 = let x = filter even ([0..102]++undefined) !! 50 in x==x | testFilter3 = let x = filter even ([0..102]++undefined) !! 50 in x==x | ||
</haskell> | |||
=== automated checks === | |||
If you are lazy when searching for laziness breakers, you may use the automated tool [http://www.cs.kent.ac.uk/people/staff/oc/StrictCheck/StrictCheck.hs StrictCheck]. | |||
<haskell> | |||
*StrictCheck> test1 10 (unzip :: [(Int,Int)] -> ([Int],[Int])) | |||
Function seems not to be least strict. | |||
Input(s): _|_ | |||
Current output: _|_ | |||
Proposed output: (_|_, _|_) | |||
Continue? | |||
</haskell> | </haskell> | ||
Line 41: | Line 51: | ||
decodeUTF8 :: [Word8] -> Either Message String | decodeUTF8 :: [Word8] -> Either Message String | ||
</haskell> | </haskell> | ||
The <hask>Either</hask> type signals that the function marks decoding failure by using the <hask>Left</hask> constructor of <hask>Either</hask>. | The <hask>Either</hask> type signals that the function marks decoding-failure by using the <hask>Left</hask> constructor of <hask>Either</hask>. | ||
This function cannot be lazy, because when you access the first character of the result, | This function cannot be lazy, because when you access the first character of the result, it must already be computed, whether the result is <hask>Left</hask> or <hask>Right</hask>. | ||
it must already be computed, whether the result is <hask>Left</hask> or <hask>Right</hask>. | |||
For this decision, the complete input must be decoded. | For this decision, the complete input must be decoded. | ||
A better type signature is | A better type signature is | ||
<haskell> | <haskell> | ||
decodeUTF8 :: [Word8] -> (Maybe Message, String) | decodeUTF8 :: [Word8] -> (Maybe Message, String) | ||
</haskell> | </haskell> | ||
where the <hask>String</hask> contains as much characters as could be decoded | where the <hask>String</hask> contains as much characters as could be decoded and <hask>Maybe Message</hask> gives the reason for the stop of the decoding. | ||
and <hask>Maybe Message</hask> gives the reason for the stop of the decoding. | <hask>Nothing</hask> means the input was completely read, <hask>Just msg</hask> means the decoding was aborted for the reason described in <hask>msg</hask>. | ||
<hask>Nothing</hask> means the input was completely read, | |||
<hask>Just msg</hask> means the decoding was aborted for the reason described in <hask>msg</hask>. | |||
If you touch the first element of the pair, the complete decodings is triggered, thus laziness is broken. | If you touch the first element of the pair, the complete decodings is triggered, thus laziness is broken. | ||
This means you should first process the <hask>String</hask> and look at <hask>Maybe Message</hask> afterwards. | This means you should first process the <hask>String</hask> and look at <hask>Maybe Message</hask> afterwards. | ||
Instead of the unspecific pair type you should use the special type for asynchronous exceptions as found in the [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/explicit-exception explicit exception] package. | Instead of the unspecific pair type you should use the special type for asynchronous exceptions as found in the [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/explicit-exception explicit exception] package. | ||
Especially in parsers you may find a function, called Wadler's force function. | Especially in parsers you may find a function, called Wadler's force function. | ||
Line 66: | Line 72: | ||
in Just x | in Just x | ||
</haskell> | </haskell> | ||
It looks like a complicated expression for <hask>y</hask> | It looks like a complicated expression for <hask>y</hask> with an added danger of failing unrecoverably when <hask>y</hask> is not <hask>Just</hask>. | ||
with an added danger of failing unrecoverably when <hask>y</hask> is not <hask>Just</hask>. | Its purpose is to use the lazy pattern matching of <hask>let</hask> and to show to the runtime system, that we expect that <hask>y</hask> is always a <hask>Just</hask>. | ||
Its purpose is to use the lazy pattern matching of <hask>let</hask> | Then the runtime system does not need to wait until it can determine the right constructor but it can proceed immediately. | ||
and to show to the runtime system, that we expect that <hask>y</hask> is always a <hask>Just</hask>. | This way, a function can be made lazy, also if it returns <hask>Maybe</hask>. | ||
Then the runtime system need | |||
This way a function can be made lazy, also if it returns <hask>Maybe</hask>. | |||
It can however fail, if later it turns out, that <hask>y</hask> is actually <hask>Nothing</hask>. <!-- fail how? To be lazy? Or it is some hideous failure like 'head []'? --> | It can however fail, if later it turns out, that <hask>y</hask> is actually <hask>Nothing</hask>. <!-- fail how? To be lazy? Or it is some hideous failure like 'head []'? --> | ||
Using force like functions is sometimes necessary, | Using force-like functions is sometimes necessary, but should be avoided for data types with more than one constructor. | ||
but should be avoided for data types with more than one constructor. | |||
It is better to use an interim data type with one constructor and lift to the multi-constructor datatype when needed. | It is better to use an interim data type with one constructor and lift to the multi-constructor datatype when needed. | ||
Consider parsers of type <hask>StateT [Word8] Maybe a</hask>. | Consider parsers of type <hask>StateT [Word8] Maybe a</hask>. | ||
Line 97: | Line 100: | ||
f (if b then x else y) | f (if b then x else y) | ||
</haskell> | </haskell> | ||
It is <hask>if undefined then f x else f y</hask> is <hask>undefined</hask>, | It is <hask>if undefined then f x else f y</hask> is <hask>undefined</hask>, whereas <hask>f (if b then x else y)</hask> is <hask>f undefined</hask>, which is a difference in [[non-strict semantics]]. | ||
whereas <hask>f (if b then x else y)</hask> is <hask>f undefined</hask>, | |||
which is a difference in [[non-strict semantics]]. | |||
Consider e.g. <hask>if b then 'a':x else 'a':y</hask>. | Consider e.g. <hask>if b then 'a':x else 'a':y</hask>. | ||
It is common source of too much strictness to make decisions too early and thus duplicate code in the decision branches. | It is common source of too much strictness to make decisions too early and thus duplicate code in the decision branches. | ||
Intuitively spoken, the bad thing about [[code duplication]] (stylistic questions put aside) is, | Intuitively spoken, the bad thing about [[code duplication]] (stylistic questions put aside) is, that the run-time system cannot see that in the branches, some things are equal and do it in common before the critical decision. | ||
that the run-time system cannot see that in the branches some things are equal and do it in common before the critical decision. | |||
Actually, the compiler and run-time system could be "improved" to do so, but in order to keep things predictable, they do not do so. | Actually, the compiler and run-time system could be "improved" to do so, but in order to keep things predictable, they do not do so. | ||
Even more, this behaviour is required by theory, | Even more, this behaviour is required by theory, since by pushing decisions to the inner of an expression you change the semantics of the expression. | ||
since by pushing decisions to the inner of an expression you change the semantics of the expression. | |||
So we return to the question, what the programmer actually wants. | So we return to the question, what the programmer actually wants. | ||
Line 118: | Line 117: | ||
is maximally lazy? | is maximally lazy? | ||
It seems so, but actually it is not. In both branches we create non-empty lists, but the run-time system cannot see this. | It seems so, but actually it is not. In both branches we create non-empty lists, but the run-time system cannot see this. | ||
It is <hask>null (if undefined then [x] else y:ys)</hask> again <hask>undefined</hask>, | It is <hask>null (if undefined then [x] else y:ys)</hask> again <hask>undefined</hask>, but we like to have it evaluated to <hask>False</hask>. | ||
but we like to have it evaluated to <hask>False</hask>. | |||
Here we need lazy pattern matching as provided by <hask>let</hask>. | Here we need lazy pattern matching as provided by <hask>let</hask>. | ||
<haskell> | <haskell> | ||
Line 143: | Line 141: | ||
uncurry (:) (if b then (x,[]) else (y,ys)) | uncurry (:) (if b then (x,[]) else (y,ys)) | ||
</haskell> | </haskell> | ||
Another example is the <hask>inits</hask> function. | Another example is the <hask>inits</hask> function. | ||
Line 153: | Line 150: | ||
</haskell> | </haskell> | ||
is suggested. | is suggested. | ||
However you find that <hask>inits undefined</hask> is undefined, | However you find that <hask>inits undefined</hask> is undefined, although <hask>inits</hask> always should return the empty list as first element. | ||
although <hask>inits</hask> always should return the empty list as first element. | |||
The following implementation does exactly this: | The following implementation does exactly this: | ||
<haskell> | <haskell> | ||
Line 170: | Line 166: | ||
I do not know whether the following example can be simplified. | I do not know whether the following example can be simplified. | ||
In this form it | In this form it occurred in a real application, namely the HTTP package. | ||
Consider the following action of the <hask>Control.Monad.RWS</hask> which fetches a certain number of elements from a list. | Consider the following action of the <hask>Control.Monad.RWS</hask> which fetches a certain number of elements from a list. | ||
Line 176: | Line 172: | ||
The reader part provides an element which means that the input is consumed. | The reader part provides an element which means that the input is consumed. | ||
It is returned as singleton when the caller tries to read from a completely read input. | It is returned as singleton when the caller tries to read from a completely read input. | ||
The writer allows to log some information, however the considered action does not output | The writer allows to log some information, however the considered action does not output anything to the log. | ||
<haskell> | <haskell> | ||
getN :: Int -> RWS a [Int] [a] [a] | getN :: Int -> RWS a [Int] [a] [a] | ||
Line 186: | Line 182: | ||
in put rest >> return fetched | in put rest >> return fetched | ||
</haskell> | </haskell> | ||
As we learned as good imperative programmers, we only call <hask>splitAt</hask> when the input is non-empty, | As we learned as good imperative programmers, we only call <hask>splitAt</hask> when the input is non-empty, that is, only if there is something to fetch. | ||
that is, only if there is something to fetch. | This works in even more many corner cases, but not in the following one. | ||
This works even more many corner cases, but not in the following one. | Although <hask>getN</hask> does obviously not log something (i.e. it does not call <hask>tell</hask>), it requires to read the input in order to find out, that nothing was logged: | ||
Although <hask>getN</hask> does obviously not log something (i.e. it does not call <hask>tell</hask>), | |||
it requires to read the input in order to find out, that nothing was logged: | |||
<haskell> | <haskell> | ||
*Test> (\(_a,_s,w) -> w) $ runRWS (getN 5) '\n' undefined | *Test> (\(_a,_s,w) -> w) $ runRWS (getN 5) '\n' undefined | ||
Line 196: | Line 190: | ||
</haskell> | </haskell> | ||
The problem is again, that <hask>if</hask> checks the emptiness of the input, | The problem is again, that <hask>if</hask> checks the emptiness of the input, which is undefined, since the input is undefined. | ||
which is undefined, since the input is undefined. | |||
Thus we must ensure, that the invoked monadic actions are run independent from the input. | Thus we must ensure, that the invoked monadic actions are run independent from the input. | ||
Only this way, the run-time system can see that the logging stream is never touched. | Only this way, the run-time system can see that the logging stream is never touched. | ||
Line 240: | Line 233: | ||
Consider the <hask>partition</hask> function which sorts elements, that match a predicate, into one list and the non-matching elements into another list. | Consider the <hask>partition</hask> function which sorts elements, that match a predicate, into one list and the non-matching elements into another list. | ||
This function should also work on infinite lists, | This function should also work on infinite lists, but the implementation shipped with GHC up to 6.2 [http://www.haskell.org/pipermail/libraries/2004-October/002645.html failed on infinite lists]. | ||
but the implementation shipped with GHC up to 6.2 [http://www.haskell.org/pipermail/libraries/2004-October/002645.html failed on infinite lists]. | |||
What happened? | What happened? | ||
The reason was too strict | The reason was that pattern matching was too strict. | ||
Let's first consider the following correct implementation: | Let's first consider the following correct implementation: | ||
Line 271: | Line 263: | ||
(\ ~(y,z) -> if p a then (a:y, z) else (y, a:z)) (foldr ... ([],[]) as) | (\ ~(y,z) -> if p a then (a:y, z) else (y, a:z)) (foldr ... ([],[]) as) | ||
</haskell> | </haskell> | ||
We see that the whether <hask>a</hask> is prepended to the first or the second list, | We see that the whether <hask>a</hask> is prepended to the first or the second list, does only depend on <hask>p a</hask>, and neither on <hask>y</hask> nor on <hask>z</hask>. | ||
does only depend on <hask>p a</hask>, and neither on <hask>y</hask> nor on <hask>z</hask>. | |||
The laziness annotation <hask>~</hask> is crucial, since it tells, intuitively spoken, | The laziness annotation <hask>~</hask> is crucial, since it tells, intuitively spoken, | ||
that we can rely on the recursive call of <hask>foldr</hask> to return a pair and not <hask>undefined</hask>. | that we can rely on the recursive call of <hask>foldr</hask> to return a pair and not <hask>undefined</hask>. | ||
Line 278: | Line 269: | ||
This fails for infinite lists and is inefficient for finite lists, and that was the bug in former implementations of <hask>partition</hask>. | This fails for infinite lists and is inefficient for finite lists, and that was the bug in former implementations of <hask>partition</hask>. | ||
Btw. by the expansion you also see, that it would not help to omit the tilde and apply the above 'force' trick to the 'if-then-else' expression. | Btw. by the expansion you also see, that it would not help to omit the tilde and apply the above 'force' trick to the 'if-then-else' expression. | ||
However, there still remains a small laziness break: | |||
There is an unnecessary decision between the pair constructor of the initial accumulator value <hask>([],[])</hask> and the pair constructors within the <hask>if</hask>. | |||
This can only be avoided by applying a <hask>force</hask> function to the result of <hask>foldr</hask>: | |||
<haskell> | |||
partition :: (a -> Bool) -> [a] -> ([a], [a]) | |||
partition p = | |||
(\ ~(ys,zs) -> (ys,zs)) . | |||
foldr | |||
(\x ~(y,z) -> | |||
if p x | |||
then (x : y, z) | |||
else (y, x : z)) | |||
([],[]) | |||
</haskell> | |||
=== List reversal === | === List reversal === | ||
Any use of the list function <hask>reverse</hask> should alert you, | Any use of the list function <hask>reverse</hask> should alert you, since when you access the first element of a reversed list, then all nodes of the input list must be evaluated and stored in memory. | ||
since when you access the first element of a reversed list, then all nodes of the input list must be evaluated and stored in memory. | |||
Think twice whether it is really needed. | Think twice whether it is really needed. | ||
The | The articles on [[Infinity and efficiency]] and [[List traversal]] | ||
show how to avoid list reversal. | |||
== | == Lazy input and output == | ||
[https://arxiv.org/pdf/1907.11354 Like in SWI-Prolog], lists can can be used for lazy I/O. | |||
So for ordinary functions, lazily writing data to output is usually not a problem. | |||
Consider the | But in Haskell, lazily reading data as input can require some sort of internal I/O hack and thus caution. | ||
Consider the simple program: | |||
<haskell> | <haskell> | ||
readFile "source" >>= writeFile "target" | readFile "source" >>= writeFile "target" | ||
</haskell> | </haskell> | ||
which copies the file <code>source</code> to the file <code>target</code> with constant memory consumption, | |||
since <hask>readFile</hask> reads the data lazily and <hask>writeFile</hask> writes it as it comes in. | which copies the file <code>source</code> to the file <code>target</code> with constant memory consumption, since <hask>readFile</hask> reads the data lazily and <hask>writeFile</hask> writes it as it comes in. | ||
However it fails badly, when a file | However it fails badly, when a file is to be updated in-place: | ||
<haskell> | <haskell> | ||
readFile "text" >>= writeFile "text" . map toUpper | readFile "text" >>= writeFile "text" . map toUpper | ||
</haskell> | </haskell> | ||
This would work | |||
that is it would read the file contents to memory before | This would only work if <hask>readFile</hask> was strict, that is it would read the file contents in full to memory before passing that data (as a <code>String</code>) back to its caller. | ||
The function <hask>readFile</hask> needs certain hacks: | The function <hask>readFile</hask> needs certain hacks: | ||
* The function <hask>unsafeInterleaveIO</hask> is needed for | * The function <hask>unsafeInterleaveIO</hask> is needed for deferring the calls to <hask>hGetChar</hask> until the characters are actually needed. | ||
* Exceptions that occur while reading the file are raised in the code that writes the result of processing the file content to somewhere | * Exceptions, that occur while reading the file, are raised in the code that writes the result of processing the file content to somewhere i.e. the exceptions produced by <hask>readFile</hask> can occur in code that has nothing to do with file reading and there is no warning, that they might occur there. The [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/explicit-exception explicit exception] package can also help here, by making the reason for the stop of the file read explicit. Exceptions must still be handled in code, that does not read the file, but the fact that they are explicit helps you to not forget it. | ||
* The file must be closed after it is no longer needed. The documentation says, that the file is put into a semi-closed state. | * The file must be closed after it is no longer needed. The documentation says, that the file is put into a semi-closed state. One possibility is that weak references are used, allowing the [[garbage collector]] close the file, once all other references to data in the file have been revoked. However, the garbage collector never works immediately, but in phases. It may be that the file remains open for a long time, maybe until the program exits. The <hask>Data.ByteString.Lazy.readFile</hask> function explicitly closes the file after the last byte is read. The advantage is, that the file is closed immediately. The disadvantage is, that the file is not closed at all, when not all bytes are read. E.g. if a parser encounters a parse error, it has to read the rest of the file anyway, in order to get it closed. | ||
A function | A function that handles the closing of the file for you is <hask>System.IO.withFile</hask>. | ||
You can use it like | You can use it like this: | ||
<haskell> | <haskell> | ||
withFile "source" ReadMode $ \h -> | withFile "source" ReadMode $ \h -> | ||
hGetLine h >>= putStrLn | hGetLine h >>= putStrLn | ||
</haskell> | </haskell> | ||
After the actions inside the <hask>withFile</hask> call, the file is closed. | After the actions inside the <hask>withFile</hask> call, the file is closed. | ||
However this is dangerous: | However this is dangerous: | ||
If you | * If you attempt to lazily read contents from the file out of <hask>withFile</hask>, the file is closed before the data is actually read. | ||
the file is closed before the data is actually read. | |||
Thus, although <hask>withFile "source" ReadMode hGetContents</hask> looks like <hask>readFile</hask>, | Thus, although <hask>withFile "source" ReadMode hGetContents</hask> looks like <hask>readFile</hask>, it is very different: it does not work. | ||
it is very different: | |||
How can you implement a function like <hask>hGetContents</hask> by yourselves? | How can you implement a function like <hask>hGetContents</hask> by yourselves? | ||
Line 326: | Line 334: | ||
However, calling <hask>unsafeInterleaveIO hGetChar</hask> many times would not work, | However, calling <hask>unsafeInterleaveIO hGetChar</hask> many times would not work, | ||
because the order must be preserved. | because the order must be preserved. | ||
For example, in: | |||
<haskell> | |||
hGetContents h >>= putStrLn . drop 10 | |||
</haskell> | |||
the first ten characters from the file are not needed, | |||
but <hask>hGetChar</hask> must be called for the first 10 characters anyway in order to increment the file position. | but <hask>hGetChar</hask> must be called for the first 10 characters anyway in order to increment the file position. | ||
This is achieved by not calling <hask>unsafeInterleaveIO</hask> on <hask>hGetChar</hask> but on the list constructor. | This is achieved by not calling <hask>unsafeInterleaveIO</hask> on <hask>hGetChar</hask> but on the list constructor. | ||
The implementation of <hask>hGetContents</hask> | The implementation of <hask>hGetContents</hask> resembles: | ||
<haskell> | <haskell> | ||
hGetContents h = | hGetContents h = | ||
Line 335: | Line 348: | ||
in go | in go | ||
</haskell> | </haskell> | ||
In contrast to the standard <hask>hGetContents</hask>, this implementation does not close the file | In contrast to the standard <hask>hGetContents</hask>, this implementation does not close the file | ||
(by the way, it does even not handle the end of the file), | (by the way, it does even not handle the end of the file), but the advantage of not relying on some automatism to close the file somewhen is, that you can close the file immediately after you stopped processing its content. | ||
but the advantage of not relying on some automatism to close the file somewhen is, | |||
that you can close the file immediately after you stopped processing its content. | |||
The disadvantage is that you must not forget to close the file and must do it only once. | The disadvantage is that you must not forget to close the file and must do it only once. | ||
So far we have only considered lazy | So far we have only considered lazy reading. | ||
It might also be necessary to trigger write actions when fetching data. | It might also be necessary to trigger write actions when fetching data. | ||
Consider a server-client interaction, where data can only be read, when a request was | Consider a server-client interaction, where data can only be read, when a request was sent before. | ||
It would be | It would be useful if the request is triggered by reading the result from the server. | ||
Such interactions can be programmed using the | Such interactions can be programmed using the {{HackagePackage|id=lazyio}} package. | ||
== Alternatives == | == Alternatives == | ||
From the above issues you see that laziness is a fragile thing. | From the above issues, you see that laziness is a fragile thing. | ||
Make one mistake and a function, carefully developed with laziness in mind, is no longer lazy. | Make one mistake and a function, carefully developed with laziness in mind, is no longer lazy. | ||
The type system will rarely help you hunting laziness breakers, and there is little support by debuggers. | The type system will rarely help you hunting laziness breakers, and there is little support by debuggers. | ||
Thus detecting laziness breakers | Thus, detecting laziness breakers often requires understanding a large portion of code, which is against the idea of modularity. | ||
<!-- ... and knowledge about low-level details of compilation or the runtime system. | <!-- ... and knowledge about low-level details of compilation or the runtime system. | ||
In principle that's not true. | In principle that's not true. | ||
I have argued with how the runtime system may work, | I have argued with how the runtime system may work, but it all follows from the non-strict semantics. | ||
but it all follows from the non-strict semantics. | |||
--> | --> | ||
Maybe for your case you will prefer a different idiom, that achieves the same goals in a safer way. See e.g. the [[Enumerator and iteratee]] pattern. | Maybe for your case you will prefer a different idiom, that achieves the same goals in a safer way. See e.g. the [[Enumerator and iteratee]] pattern. | ||
== See also == | |||
* Haskell-Cafe on [http://www.haskell.org/pipermail/haskell-cafe/2009-January/052940.html Maintaining laziness] | |||
* Haskell-Cafe on [http://www.haskell.org/pipermail/haskell-cafe/2009-January/053966.html How to make code least strict?] | |||
* Blog post [http://conal.net/blog/posts/lazier-function-definitions-by-merging-partial-values/ Lazier function definitions by merging partial values] by Conal Elliott | |||
* [[Laziness is not always good]] | |||
[[Category:Idioms]] | [[Category:Idioms]] |
Latest revision as of 12:02, 19 April 2024
One of Haskell's main features is non-strict semantics, which is implemented by lazy evaluation in all popular Haskell compilers. However many Haskell libraries found on Hackage are implemented as if Haskell were a strict language. This leads to unnecessary inefficiencies, memory leaks and, we suspect, unintended semantics. In this article we want to go through some techniques on how to check lazy behaviour on functions, examples of typical constructs which break laziness without need, and finally we want to link to techniques that may yield the same effect without laziness.
Checking laziness
manual checks
If you want to check whether a function is lazy enough, you may feed it with undefined values.
An undefined value can be undefined
, error "reason"
, or an infinite loop.
The latter one has the advantage that it cannot be hidden by some hacks like "catching" the error in the IO monad.
Examples:
Check whether filter
is lazy:
filter even [0..]
filter even ([0..5] ++ undefined)
If the filter
function is lazy then it keeps generating elements in the first case and it outputs a prefix of the output list, before breaking because of the undefined, in the second case.
An automated unit test can check whether infinite or corrupted input data produces correct prefixes.
Those tests usually do not fail by returning False
but by leading to undefined results, either explicit undefined
or an infinite loop.
testFilter0 = filter even [0..100] `isPrefixOf` filter even [0..]
testFilter1 = filter even [0..100] `isPrefixOf` filter even ([0..102]++undefined)
testFilter2 = let x = filter even [0..] !! 100 in x==x
testFilter3 = let x = filter even ([0..102]++undefined) !! 50 in x==x
automated checks
If you are lazy when searching for laziness breakers, you may use the automated tool StrictCheck.
*StrictCheck> test1 10 (unzip :: [(Int,Int)] -> ([Int],[Int]))
Function seems not to be least strict.
Input(s): _|_
Current output: _|_
Proposed output: (_|_, _|_)
Continue?
Laziness breakers
Maybe, Either, Exceptions
Some laziness breakers are visible in type signatures:
decodeUTF8 :: [Word8] -> Either Message String
The Either
type signals that the function marks decoding-failure by using the Left
constructor of Either
.
This function cannot be lazy, because when you access the first character of the result, it must already be computed, whether the result is Left
or Right
.
For this decision, the complete input must be decoded.
A better type signature is
decodeUTF8 :: [Word8] -> (Maybe Message, String)
where the String
contains as much characters as could be decoded and Maybe Message
gives the reason for the stop of the decoding.
Nothing
means the input was completely read, Just msg
means the decoding was aborted for the reason described in msg
.
If you touch the first element of the pair, the complete decodings is triggered, thus laziness is broken.
This means you should first process the String
and look at Maybe Message
afterwards.
Instead of the unspecific pair type you should use the special type for asynchronous exceptions as found in the explicit exception package.
Especially in parsers you may find a function, called Wadler's force function. It works as follows:
force y =
let Just x = y
in Just x
It looks like a complicated expression for y
with an added danger of failing unrecoverably when y
is not Just
.
Its purpose is to use the lazy pattern matching of let
and to show to the runtime system, that we expect that y
is always a Just
.
Then the runtime system does not need to wait until it can determine the right constructor but it can proceed immediately.
This way, a function can be made lazy, also if it returns Maybe
.
It can however fail, if later it turns out, that y
is actually Nothing
.
Using force-like functions is sometimes necessary, but should be avoided for data types with more than one constructor.
It is better to use an interim data type with one constructor and lift to the multi-constructor datatype when needed.
Consider parsers of type StateT [Word8] Maybe a
.
many :: StateT [Word8] Maybe a -> StateT [Word8] Maybe [a]
which parses as many elements of type a
as possible.
It shall be lazy and thus must be infallible and must not use the Maybe
.
It shall just return an empty list, if parsing of one element fails.
A quick hack would be to define many
using a force
function.
It would be better to show by the type, that many
cannot fail:
many :: StateT [Word8] Maybe a -> StateT [Word8] Identity [a]
Early decision
List construction
Be aware that the following two expressions are not equivalent.
-- less lazy
if b then f x else f y
-- more lazy
f (if b then x else y)
It is if undefined then f x else f y
is undefined
, whereas f (if b then x else y)
is f undefined
, which is a difference in non-strict semantics.
Consider e.g. if b then 'a':x else 'a':y
.
It is common source of too much strictness to make decisions too early and thus duplicate code in the decision branches. Intuitively spoken, the bad thing about code duplication (stylistic questions put aside) is, that the run-time system cannot see that in the branches, some things are equal and do it in common before the critical decision. Actually, the compiler and run-time system could be "improved" to do so, but in order to keep things predictable, they do not do so. Even more, this behaviour is required by theory, since by pushing decisions to the inner of an expression you change the semantics of the expression. So we return to the question, what the programmer actually wants.
Now, do you think this expression
if b
then [x]
else y:ys
is maximally lazy?
It seems so, but actually it is not. In both branches we create non-empty lists, but the run-time system cannot see this.
It is null (if undefined then [x] else y:ys)
again undefined
, but we like to have it evaluated to False
.
Here we need lazy pattern matching as provided by let
.
let z:zs =
if b
then [x]
else y:ys
in z:zs
This expression always returns the constructor (:)
and thus null
knows that the list is not empty.
However, this is a little bit unsafe, because the let z:zs
may fail if in the branches of if
there is an empty list.
This error can only caught at run-time which is bad.
We can avoid it using the single constructor pair type.
let (z,zs) =
if b
then (x,[])
else (y,ys)
in z:zs
which can be abbreviated to
uncurry (:) (if b then (x,[]) else (y,ys))
Another example is the inits
function.
In the Haskell 98 report the implementation
inits :: [a] -> [[a]]
inits [] = [[]]
inits (x:xs) = [[]] ++ map (x:) (inits xs)
is suggested.
However you find that inits undefined
is undefined, although inits
always should return the empty list as first element.
The following implementation does exactly this:
inits :: [a] -> [[a]]
inits xt =
[] :
case xt of
[] -> []
x:xs -> map (x:) (inits xs)
See also the article on base cases and identities.
Reader-Writer-State monad
I do not know whether the following example can be simplified. In this form it occurred in a real application, namely the HTTP package.
Consider the following action of the Control.Monad.RWS
which fetches a certain number of elements from a list.
The state of the monad is the input list we fetch the elements from.
The reader part provides an element which means that the input is consumed.
It is returned as singleton when the caller tries to read from a completely read input.
The writer allows to log some information, however the considered action does not output anything to the log.
getN :: Int -> RWS a [Int] [a] [a]
getN n =
do input <- get
if null input
then asks (:[])
else let (fetched,rest) = splitAt n input
in put rest >> return fetched
As we learned as good imperative programmers, we only call splitAt
when the input is non-empty, that is, only if there is something to fetch.
This works in even more many corner cases, but not in the following one.
Although getN
does obviously not log something (i.e. it does not call tell
), it requires to read the input in order to find out, that nothing was logged:
*Test> (\(_a,_s,w) -> w) $ runRWS (getN 5) '\n' undefined
*** Exception: Prelude.undefined
The problem is again, that if
checks the emptiness of the input, which is undefined, since the input is undefined.
Thus we must ensure, that the invoked monadic actions are run independent from the input.
Only this way, the run-time system can see that the logging stream is never touched.
We start refactoring by calling put
independently from input
's content.
It works as well for empty lists, since splitAt
will just return empty lists in this case.
getN :: Int -> RWS a [Int] [a] [a]
getN n =
do input <- get
let (fetched,rest) = splitAt n input
put rest
if null input
then asks (:[])
else return fetched
This doesn't resolve the problem. There is still a choice between asks
and return
.
We have to pull out ask
as well.
getN :: Int -> RWS a [Int] [a] [a]
getN n =
do input <- get
let (fetched,rest) = splitAt n input
put rest
endOfInput <- ask
return $
if null input
then [endOfInput]
else fetched
Now things work as expected:
*Test> (\(_a,_s,w) -> w) $ runRWS (getN 5) '\n' undefined
[]
We learn from this example, that sometimes in Haskell it is more efficient to call functions that are not needed under some circumstances.
Always remind, that the do notation looks only imperative, but it is not imperative.
E.g., endOfInput
is only evaluated if the end of the input is really reached.
Thus, the call ask
does not mean that there is actually an action performed between put
and return
.
Strict pattern matching in a recursion
Consider the partition
function which sorts elements, that match a predicate, into one list and the non-matching elements into another list.
This function should also work on infinite lists, but the implementation shipped with GHC up to 6.2 failed on infinite lists.
What happened?
The reason was that pattern matching was too strict.
Let's first consider the following correct implementation:
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition p =
foldr
(\x ~(y,z) ->
if p x
then (x : y, z)
else (y, x : z))
([],[])
The usage of foldr
seems to be reserved for advanced programmers.
Formally foldr
runs from the end to the start of the list.
However, how can this work if there is a list without an end?
That can be seen when applying the definition of foldr
.
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr _ b [] = b
foldr f b (a:as) = f a (foldr f b as)
Now we expand this once for an infinite input list, we get
partition p (a:as) =
(\ ~(y,z) -> if p a then (a:y, z) else (y, a:z)) (foldr ... ([],[]) as)
We see that the whether a
is prepended to the first or the second list, does only depend on p a
, and neither on y
nor on z
.
The laziness annotation ~
is crucial, since it tells, intuitively spoken,
that we can rely on the recursive call of foldr
to return a pair and not undefined
.
Omitting it, would require the evaluation of the whole input list before the first output element can be determined.
This fails for infinite lists and is inefficient for finite lists, and that was the bug in former implementations of partition
.
Btw. by the expansion you also see, that it would not help to omit the tilde and apply the above 'force' trick to the 'if-then-else' expression.
However, there still remains a small laziness break:
There is an unnecessary decision between the pair constructor of the initial accumulator value ([],[])
and the pair constructors within the if
.
This can only be avoided by applying a force
function to the result of foldr
:
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition p =
(\ ~(ys,zs) -> (ys,zs)) .
foldr
(\x ~(y,z) ->
if p x
then (x : y, z)
else (y, x : z))
([],[])
List reversal
Any use of the list function reverse
should alert you, since when you access the first element of a reversed list, then all nodes of the input list must be evaluated and stored in memory.
Think twice whether it is really needed.
The articles on Infinity and efficiency and List traversal
show how to avoid list reversal.
Lazy input and output
Like in SWI-Prolog, lists can can be used for lazy I/O. So for ordinary functions, lazily writing data to output is usually not a problem. But in Haskell, lazily reading data as input can require some sort of internal I/O hack and thus caution. Consider the simple program:
readFile "source" >>= writeFile "target"
which copies the file source
to the file target
with constant memory consumption, since readFile
reads the data lazily and writeFile
writes it as it comes in.
However it fails badly, when a file is to be updated in-place:
readFile "text" >>= writeFile "text" . map toUpper
This would only work if readFile
was strict, that is it would read the file contents in full to memory before passing that data (as a String
) back to its caller.
The function readFile
needs certain hacks:
- The function
unsafeInterleaveIO
is needed for deferring the calls tohGetChar
until the characters are actually needed. - Exceptions, that occur while reading the file, are raised in the code that writes the result of processing the file content to somewhere i.e. the exceptions produced by
readFile
can occur in code that has nothing to do with file reading and there is no warning, that they might occur there. The explicit exception package can also help here, by making the reason for the stop of the file read explicit. Exceptions must still be handled in code, that does not read the file, but the fact that they are explicit helps you to not forget it. - The file must be closed after it is no longer needed. The documentation says, that the file is put into a semi-closed state. One possibility is that weak references are used, allowing the garbage collector close the file, once all other references to data in the file have been revoked. However, the garbage collector never works immediately, but in phases. It may be that the file remains open for a long time, maybe until the program exits. The
Data.ByteString.Lazy.readFile
function explicitly closes the file after the last byte is read. The advantage is, that the file is closed immediately. The disadvantage is, that the file is not closed at all, when not all bytes are read. E.g. if a parser encounters a parse error, it has to read the rest of the file anyway, in order to get it closed.
A function that handles the closing of the file for you is System.IO.withFile
.
You can use it like this:
withFile "source" ReadMode $ \h ->
hGetLine h >>= putStrLn
After the actions inside the withFile
call, the file is closed.
However this is dangerous:
- If you attempt to lazily read contents from the file out of
withFile
, the file is closed before the data is actually read.
Thus, although withFile "source" ReadMode hGetContents
looks like readFile
, it is very different: it does not work.
How can you implement a function like hGetContents
by yourselves?
You need to call hGetChar
in a lazy way.
This is achieved by unsafeInterleaveIO
.
However, calling unsafeInterleaveIO hGetChar
many times would not work,
because the order must be preserved.
For example, in:
hGetContents h >>= putStrLn . drop 10
the first ten characters from the file are not needed,
but hGetChar
must be called for the first 10 characters anyway in order to increment the file position.
This is achieved by not calling unsafeInterleaveIO
on hGetChar
but on the list constructor.
The implementation of hGetContents
resembles:
hGetContents h =
let go = unsafeInterleaveIO $ liftM2 (:) (hGetChar h) go
in go
In contrast to the standard hGetContents
, this implementation does not close the file
(by the way, it does even not handle the end of the file), but the advantage of not relying on some automatism to close the file somewhen is, that you can close the file immediately after you stopped processing its content.
The disadvantage is that you must not forget to close the file and must do it only once.
So far we have only considered lazy reading. It might also be necessary to trigger write actions when fetching data. Consider a server-client interaction, where data can only be read, when a request was sent before. It would be useful if the request is triggered by reading the result from the server. Such interactions can be programmed using the lazyio package.
Alternatives
From the above issues, you see that laziness is a fragile thing. Make one mistake and a function, carefully developed with laziness in mind, is no longer lazy. The type system will rarely help you hunting laziness breakers, and there is little support by debuggers.
Thus, detecting laziness breakers often requires understanding a large portion of code, which is against the idea of modularity.
Maybe for your case you will prefer a different idiom, that achieves the same goals in a safer way. See e.g. the Enumerator and iteratee pattern.
See also
- Haskell-Cafe on Maintaining laziness
- Haskell-Cafe on How to make code least strict?
- Blog post Lazier function definitions by merging partial values by Conal Elliott
- Laziness is not always good