HaskellWiki - User contributions [en]

Dealing with binary data

2008-02-02T23:38:23Z

AdamLangley: Add documentation about incremental parsing

== Handling Binary Data with Haskell ==

Many programming problems call for the use of binary formats for compactness,
ease-of-use, compatibility or speed. This page quickly covers some common
libraries for handling binary data in Haskell.

=== Bytestrings ===

Everything else in this tutorial will be based on bytestrings. Normal Haskell
<hask>String</hask> types are linked lists of 32-bit characters. This has a
number of useful properties like coverage of the Unicode space and laziness,
however when it comes to dealing with bytewise data, <hask>String</hask>
involves a space-inflation of about 24x and a large reduction in speed.

Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a <code>uint8_t[]</code>—although bytestrings know their length and don't allow overflows, etc.

There are two major flavours of bytestrings: strict and lazy. Strict
bytestrings are exactly what you would expect—a linear array of bytes in
memory. Lazy bytestrings are a list of strict bytestrings; often this is called
a cord in other languages. When reading a lazy bytestring from a file, the data
will be read chunk by chunk and the file can be larger than the size of memory.
The default chunk size is currently 32K.

Within each flavour of bytestring comes the Word8 and Char8 versions. These are
mostly an aid to the type system since they are fundamentally the same size of
element. The Word8 unpacks as a list of <hask>Word8</hask> elements (bytes),
the Char8 unpacks as a list of <hask>Char</hask>, which may be useful if you
want to convert them to <hask>Strings</hask>

You might want to open the documentation for
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html strict bytestrings] and
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Lazy.html lazy bytestrings]
in another tab so that you can follow along.

==== Simple file IO ====

Here's a very simple program which copies a file from standard input to
standard output

<haskell>module Main where

import qualified Data.ByteString as B

main :: IO ()
main = do
contents <- B.getContents
B.putStr contents</haskell>

Note that we are using strict bytestrings here. (It's quite common to import the
<code>ByteString</code> module under the names <code>B</code> or <code>BS</code>.)
Since the bytestrings are strict, the code will read the whole of <code>stdin</code> into
memory and then write it out. If the input was too large this would overflow
the available memory and fail.

Let's see the same program using lazy bytestrings. We are just changing the
imported ByteString module to be the lazy one and calling the exact same
functions from the new module:

<haskell>module Main where

import qualified Data.ByteString.Lazy as BL

main :: IO ()
main = do
contents <- BL.getContents
BL.putStr contents</haskell>

This code, because of the lazy bytestrings, will cope with any sized input and
will start producing output before all the input has been read. You can think
of the code as setting up a pipeline, rather than executing in-order, as you
might expect. As <hask>putStr</hask> needs more data, it will cause the lazy
bytestring <hask>contents</hask> to read more until the end of the input is
found.

You should review the [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html documentation]
which lists all the functions which operate on ByteStrings. The documentation
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
generally find the same functions in each, with the same names. Remember to
import the modules as <code>qualified</code> and give them different names.

==== The guts of ByteStrings ====

I'll just mention in passing that sometimes you need to do something which would
endanger the referential transparency of ByteStrings. Generally you only need
to do this when using the FFI to interface with C libraries. Should such a need
arise, you can have a look at the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Internal.html internal functions] and the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Unsafe.html unsafe functions].
Remember that the last set of functions are called unsafe for a reason—misuse
can crash you program!

=== Binary parsing ===

Once you have your data as a bytestring you'll be wanting to parse something
from it. Here you need to install the
<tt>[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1 binary]</tt> package. You should read the instructions on
[http://haskell.org/haskellwiki/Cabal/How_to_install_a_Cabal_package how to install a Cabal package] if you haven't done so already.

The <tt>binary</tt> package has three major parts: the <code>Get</code> monad,
the <code>Put</code> monad and a general serialisation for Haskell types. The
latter is like the <tt>pickle</tt> module that you may know from Python—it
has its own serialisation format and I won't be covering it any more here.
However, if you just need to persist some Haskell data structures, it might be
exactly what you want: the documentation is
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html here]

==== The <tt>Get</tt> monad ====

The <tt>Get</tt> monad is a state monad; it keeps some state and each action
updates that state. The state in this case is an offset into the bytestring
which is getting parsed. <tt>Get</tt> parses lazy bytestrings; this is how
packages like
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]
can parse files several gigabytes long in constant memory: they are using a
pipeline of lazy bytestrings. However, this also has a downside. When parsing a
lazy bytestring a parse failure (such as running off the end of the bytestring)
is signified by an exception. Exceptions can only be caught in the IO monad
and, because of laziness, might not be thrown exactly where you expect. If this
is a problem, you probably want a strict version of <tt>Get</tt>, which is
covered below.

Here's an example of using the <tt>Get</tt> monad:

<haskell>import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- BL.getContents
print $ runGet deserialiseHeader input</haskell>

This code takes three big-endian, 32-bit unsigned numbers from the input string
and returns them as a tuple. Let's try running it:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(825373492,825373492,825373493)</pre>

Makes sense, right? Look what happens if the input is too short:

<pre>% runhaskell /tmp/example.hs << EOF
tooshort
EOF
(1953460083,1752134260,example.hs: too few bytes. Failed reading at byte position 12</pre>

Here an exception was thrown because we ran out of bytes.

So the <tt>Get</tt> monad consists of a set of operations like
<hask>getWord32be</hask> which walk over the input and return some type of
data. You can see the full list of those functions in the
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Get.html documentation].

Here's another example; decoding an EOF-terminated list of
numbers just involves recursion:

<haskell>listOfWord16 = do
empty <- isEmpty
if empty
then return []
else do v <- getWord64be
rest <- listOfWord16
return (v : rest)</haskell>

==== Strict <tt>Get</tt> monad ====

If you're parsing small messages then, firstly your input isn't going to be a
lazy bytestring but a strict one. That's not reallly a problem because you can
easilly convert between them. However, if you want to handle parse failures you
either have to write your parser very carefully, or you have to deal with the
fact that you can only catch exceptions in the IO monad.

If this is your dilemma, then you need a strict version of the <tt>Get</tt>
monad. It's almost exactly the same, but a parser of type <hask>Get a</hask>
results in <hask>(Either String a, ByteString)</hask> as the result of
<hask>runGet</hask>. That type is a tuple where the first value is ''either'' a
string (an error string from the parse) or the result, and the second value is
the remaining bytestring when the parser finished.

Let's update the first example with this strict version of <tt>Get</tt>. You'll
have to install the
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict-0.2.2 binary-strict]
package for it to work.

<haskell>import qualified Data.ByteString as B
import Data.Binary.Strict.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- B.getContents
print $ runGet deserialiseHeader input</haskell>

Note that all we're done is change from lazy bytestrings to strict bytestrings
and change to importing <tt>Data.Binary.Strict.Get</tt>. Now we'll run
it again:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(Right (825373492,825373492,825373493),"\n")</pre>

Now we can see that the parser was successful (we got a <tt>Right</tt>) and we
can see that our shell actually added an extra newline on the input (correctly)
and the parser didn't consume that, so it's also returned to us. Now we try it
with a truncated input:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> tooshort
heredoc> EOF
(Left "too few bytes","\n")</pre>

This time we didn't get an exception, but a <tt>Left</tt> value, which can be
handled in pure code. The remaining bytestring is the same because our
truncated input is 9 bytes long, parsing the first two <tt>Word32</tt>'s
consumed 8 bytes and parsing the third failed—at which point we had the last
byte still in the input.

In your parser, you can also call <hask>fail</hask>, with an error string,
which will result in a <tt>Left</tt> value.

That's it; it's otherwise the same as the <tt>Get</tt> monad.

====Incremental parsing====

If you have to deal with a protocol which isn't length prefixed, or otherwise
chunkable, from the network then you are faced with the problem of knowing when
you have enough data to parse something semantically useful. You could run a
strict <tt>Get</tt> over what you have and catch the truncation result, but
that means that you're parsing the data multiple times etc.

Instead, you can use an incremental parser. There's an incremental version of
the <tt>Get</tt> monad in <tt>Data.Binary.Strict.IncrementalGet</tt> (you'll
need the <tt>binary-strict</tt> package).

You use it as normal, but rather than returning an <tt>Either</tt> value, you
get a [http://hackage.haskell.org/packages/archive/binary-strict/0.2.4/doc/html/Data-Binary-Strict-IncrementalGet.html#t%3AResult Result]. You need to go follow that link and look at the documentation for <tt>Result</tt>.

It reflects the three outcomes of parsing possibly truncated data. Either the
data is invalid as is, or it's complete, or it's truncated. In the truncated
case you are given a function (called a continuation), to which you can pass
more data, when you get it, and continue the parse. The continuation, again,
returns a <tt>Result</tt> depending on the result of parsing the additional
data as well.

====Bit twiddling====

Even with all this monadic goodness, sometimes you just need to move some bits
around. That's perfectly possible in Haskell too. Just import
<tt>Data.Bits</tt> and use the following table.

<table>
<tr><th>Name</th><th>C operator</th><th>Haskell</th></tr>
<tr><td>AND</td><td><tt>&</tt></td><td><hask>.&.</hask></td></tr>
<tr><td>OR</td><td><tt>|</tt></td><td><hask>.|.</hask></td></tr>
<tr><td>XOR</td><td><tt>^</tt></td><td><hask>`xor`</hask></td></tr>
<tr><td>NOT</td><td><tt>¬</tt></td><td><hask>`complement`</hask></td></tr>
<tr><td>Left shift</td><td><tt><<</tt></td><td><hask>`shiftL`</hask></td></tr>
<tr><td>Right shift</td><td><tt>>></tt></td><td><hask>`shiftR`</hask></td></tr>
</table>

====The <tt>BitGet</tt> monad====

As an alternative to bit twiddling, you can also use the <tt>BitGet</tt> monad.
This is another state-like monad, like <tt>Get</tt>, but here the state
includes the current bit-offset in the input. This means that you can easily pull out
unaligned data. Sadly, haddock is currently breaking when trying to generate the
documentation for <tt>BitGet</tt> so I'll start with an example. Again, you'll
need the
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict-0.2.2 binary-strict] package installed.

Here's a description of the header of a DNS packet, direct from RFC 1035:

<pre> 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+</pre>

The actual fields don't matter, but here's a function for parsing it:

<haskell>parseHeader :: G.Get Header
parseHeader = do
id <- G.getWord16be
flags <- G.getByteString 2
qdcount <- G.getWord16be >>= return . fromIntegral
ancount <- G.getWord16be >>= return . fromIntegral
nscount <- G.getWord16be >>= return . fromIntegral
arcount <- G.getWord16be >>= return . fromIntegral

let r = BG.runBitGet flags (do
isquery <- BG.getBit
opcode <- BG.getAsWord8 4 >>= parseEnum
aa <- BG.getBit
tc <- BG.getBit
rd <- BG.getBit
ra <- BG.getBit

BG.getAsWord8 3
rcode <- BG.getAsWord8 4 >>= parseEnum

return $ Header id isquery opcode aa tc rd ra rcode qdcount ancount nscount arcount)

case r of
Left error -> fail error
Right x -> return x</haskell>

Here you can see that only the second line (from the ASCII-art diagram) is
parsed using <tt>BitGet</tt>. An outer <tt>Get</tt> monad is used for
everything else and the bit fields are pulled out with
<hask>getByteString</hask>. Again, <tt>BitGet</tt> is a strict monad and
returns an <tt>Either</tt>, but it doesn't return the remaining bytestring,
just because there's no obvious way to represent a bytestring of a fractional
number of bytes.

You can see the list of <tt>BitGet</tt> functions and their comments in the
[http://darcs.imperialviolet.org/darcsweb.cgi?r=binary-strict;a=headblob;f=/src/Data/Binary/Strict/BitGet.hs source code].

===Binary generation===

In contrast to parsing binary data, you might want to generate it. This is the
job of the <tt>Put</tt> monad. Follow along with the
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Put.html documentation]
if you like.

The <tt>Put</tt> monad is another state-like monad, but the state is an offset
into a series of buffers where the generated data is placed. All the buffer
creation and handling is done for you, so you can just forget about it. It
results in a lazy bytestring (so you can generate outputs that are larger than memory).

Here's the reverse of our simple <tt>Get</tt> example:

<haskell>import qualified Data.ByteString.Lazy as BL
import Data.Binary.Put

serialiseSomething :: Put
serialiseSomething = do
putWord32be 1
putWord16be 2
putWord8 3

main :: IO ()
main = BL.putStr $ runPut serialiseSomething</haskell>

And running it shows that it's generating the correct serialisation:

<pre>% runhaskell /tmp/example.hs| hexdump -C
00000000 00 00 00 01 00 02 03 |.......|</pre>

If you want the output of <tt>runPut</tt> to be a strict bytestring, you just
need to convert it with <hask>B.concat $ BL.toChunks $ runPut xyz</hask>.

One limitation of <tt>Put</tt>, due to the nature of the <tt>Builder</tt> monad
which it works with, is that you can't get the current offset into the output.
This can be an issue with some formats which require you to encode byte offsets
into the file. You have to calculate these byte offsets yourself.

=== Other useful packages ===

There are other packages which you should know about, but which are mostly
covered by their documentation:

* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/network-bytestring-0.1.1 network-bytestring]: for reading and writing bytestring from the network
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.4.0.2 zlib] and [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.4.0.1 bzlib]: for compressed formats
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.3 encoding]: for dealing with character encodings
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]: as an example of lazy parsing/serialisation

Tutorials

2008-01-29T20:56:08Z

AdamLangley: Add link to DealingWithBinaryData

==Introductions to Haskell==

These are the recommended places to start learning, short of buying a textbook.

=== Best places to start ===

;[http://darcs.haskell.org/yaht/yaht.pdf Yet Another Haskell Tutorial]
:By Hal Daume III et al. A recommended tutorial for Haskell that is still under construction but covers already much ground. Also a classic text. Now available [http://en.wikibooks.org/wiki/Haskell/YAHT as a wikibook].

;[http://en.wikibooks.org/wiki/Haskell Haskell Wikibook]
:A communal effort by several authors to produce the definitive Haskell textbook. Its very much a work in progress at the moment, and contributions are welcome.

;[http://halogen.note.amherst.edu/~jdtang/scheme_in_48/tutorial/overview.html Write Yourself a Scheme in 48 Hours in Haskell]
:A Haskell Tutorial, by Jonathan Tang. Most Haskell tutorials on the web seem to take a language-reference-manual approach to teaching. They show you the syntax of the language, a few language constructs, and then have you construct a few simple functions at the interactive prompt. The "hard stuff" of how to write a functioning, useful program is left to the end, or sometimes omitted entirely. This tutorial takes a different tack. You'll start off with command-line arguments and parsing, and progress to writing a fully-functional Scheme interpreter that implements a good-sized subset of R5RS Scheme. Along the way, you'll learn Haskell's I/O, mutable state, dynamic typing, error handling, and parsing features. By the time you finish, you should be fairly fluent in both Haskell and Scheme.

=== More tutorials ===

;[http://www.haskell.org/tutorial/ A Gentle Introduction to Haskell] :By Paul Hudak, John Peterson, and Joseph H. Fasel. The title is misleading. Some knowledge of another functional programming language is expected. The emphasis is on the type system and those features which are really new in Haskell (compared to other functional programming languages). A classic, but not for the faint of heart (it's not so gentle). Also available in [http://gorgonite.developpez.com/livres/traductions/haskell/gentle-haskell/ French] and [http://www.rsdn.ru/article/haskell/haskell_part1.xml Russian].

;[[H-99: Ninety-Nine Haskell Problems]]
:A collection of programming puzzles, with Haskell solutions. Solving these is a great way to get into Haskell programming.

;[http://www.haskell.org/~pairwise/intro/intro.html Haskell Tutorial for C Programmers]
:By Eric Etheridge. From the intro: "This tutorial assumes that the reader is familiar with C/C++, Python, Java, or Pascal. I am writing for you because it seems that no other tutorial was written to help students overcome the difficulty of moving from C/C++, Java, and the like to Haskell."

;[http://www-106.ibm.com/developerworks/edu/os-dw-linuxhask-i.html Beginning Haskell]
:From IBM developerWorks. This tutorial targets programmers of imperative languages wanting to learn about functional programming in the language Haskell. If you have programmed in languages such as C, Pascal, Fortran, C++, Java, Cobol, Ada, Perl, TCL, REXX, JavaScript, Visual Basic, or many others, you have been using an imperative paradigm. This tutorial provides a gentle introduction to the paradigm of functional programming, with specific illustrations in the Haskell 98 language. (Free registration required.)

;[http://www.informatik.uni-bonn.de/~ralf/teaching/Hskurs_toc.html Online Haskell Course]
:By Ralf Hinze (in German).

;[http://www.cs.chalmers.se/~rjmh/tutorials.html Tutorial Papers in Functional Programming].
:A collection of links to other Haskell tutorials, from John Hughes.

;[http://www.cs.ou.edu/cs1323h/textbook/haskell.shtml Two Dozen Short Lessons in Haskell]
:By Rex Page. A draft of a textbook on functional programming, available by ftp. It calls for active participation from readers by omitting material at certain points and asking the reader to attempt to fill in the missing information based on knowledge they have already acquired. The missing information is then supplied on the reverse side of the page.

;[ftp://ftp.geoinfo.tuwien.ac.at/navratil/HaskellTutorial.pdf Haskell-Tutorial]
:By Damir Medak and Gerhard Navratil. The fundamentals of functional languages for beginners.

;[http://video.s-inf.de/#FP.2005-SS-Giesl.(COt).HD_Videoaufzeichnung Video Lectures]
:Lectures (in English) by Jürgen Giesl. About 30 hours in total, and great for learning Haskell. The lectures are 2005-SS-FP.V01 through 2005-SS-FP.V26. Videos 2005-SS-FP.U01 through 2005-SS-FP.U11 are exercise answer sessions, so you probably don't want those.

;[http://www.cs.utoronto.ca/~trebla/fp/ Albert's Functional Programming Course]
:A 15 lesson introduction to most aspects of Haskell.

;[http://www.iceteks.com/articles.php/haskell/1 Introduction to Haskell]
:By Chris Dutton, An "attempt to bring the ideas of functional programming to the masses here, and an experiment in finding ways to make it easy and interesting to follow".

;[http://www.csc.depauw.edu/~bhoward/courses/0203Spring/csc122/haskintro/ An Introduction to Haskell]
:A brief introduction, by Brian Howard.

;[http://web.syntaxpolice.org/lectures/haskellTalk/slides/index.html Introduction to Haskell]
:By Isaac Jones (2003).

;[http://www.linuxjournal.com/article/9096 Translating Haskell into English]
:By Shannon Behrens, a glimpse of the Zen of Haskell, without requiring that they already be Haskell converts.

;[http://www.shlomifish.org/lecture/Perl/Haskell/slides/ Haskell for Perl Programmers]
:Brief introduction to Haskell, with a view to what perl programmers are interested in

;[http://lisperati.com/haskell/ How To Organize a Picnic on a Computer]
:Fun introduction to Haskell, step by step building of a program to seat people at a planned picnic, based on their similarities using data from a survey and a map of the picnic location.

;[http://cs.wwc.edu/KU/PR/Haskell.html Haskell Tutorial]

== Motivation for using Haskell ==

;[http://www.md.chalmers.se/~rjmh/Papers/whyfp.html Why Functional Programming Matters]
:By [http://www.md.chalmers.se/~rjmh/ John Hughes], The Computer Journal, Vol. 32, No. 2, 1989, pp. 98 - 107. Also in: David A. Turner (ed.): Research Topics in Functional Programming, Addison-Wesley, 1990, pp. 17 - 42.<BR> Exposes the advantages of functional programming languages. Demonstrates how higher-order functions and lazy evaluation enable new forms of modularization of programs.

;[[Why Haskell matters]]
:Discussion of the advantages of using Haskell in particular. An excellent article.

;[http://www.cs.ukc.ac.uk/pubs/1997/224/index.html Higher-order + Polymorphic = Reusable]
:By [http://www.cs.ukc.ac.uk/people/staff/sjt/index.html Simon Thompson]. Unpublished, May 1997.<BR> <STRONG>Abstract:</STRONG> This paper explores how certain ideas in object oriented languages have their correspondents in functional languages. In particular we look at the analogue of the iterators of the C++ standard template library. We also give an example of the use of constructor classes which feature in Haskell 1.3 and Gofer.

;[http://www-128.ibm.com/developerworks/java/library/j-cb07186.html Explore functional programming with Haskell]
:Introduction to the benefits of functional programming in Haskell by Bruce Tate.

== Blog articles ==

There are a large number of tutorials covering diverse Haskell topics
published as blogs. Some of the best of these articles are collected
here:

;[[Blog articles]]

==Practical Haskell==

These tutorials examine using Haskell to writing complex real-world applications

;[http://research.microsoft.com/%7Esimonpj/Papers/marktoberdorf Tackling the awkward squad: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell]
:Simon Peyton Jones. Presented at the 2000 Marktoberdorf Summer School. In "Engineering theories of software construction", ed Tony Hoare, Manfred Broy, Ralf Steinbruggen, IOS Press, ISBN 1-58603-1724, 2001, pp47-96. The standard reference for monadic IO in GHC/Haskell. <br><strong>Abstract:</strong>Functional programming may be beautiful, but to write real applications we must grapple with awkward real-world issues: input/output, robustness, concurrency, and interfacing to programs written in other languages.

;[[Hitchhikers Guide to the Haskell]]
: Tutorial for C/Java/OCaml/... programers by Dmitry Astapov. From the intro: "This text intends to introduce the reader to the practical aspects of Haskell from the very beginning (plans for the first chapters include: I/O, darcs, Parsec, QuickCheck, profiling and debugging, to mention a few)".

;[http://haskell.org/haskellwiki/IO_inside Haskell I/O inside: Down the Rabbit's Hole]
:By Bulat Ziganshin (2006), a comprehensive tutorial on using IO monad.

;[http://web.archive.org/web/20060622030538/http://www.reid-consulting-uk.ltd.uk/docs/ffi.html A Guide to Haskell's Foreign Function Interface]
:A guide to using the foreign function interface extension, using the rich set of functions in the Foreign libraries, design issues, and FFI preprocessors.

;[http://blogs.nubgames.com/code/?p=22 Haskell IO for imperative programmers]
:A short introduction to IO from the perspective of an imperative programmer.

;[[A brief introduction to Haskell|A Brief Introduction to Haskell]]
:A translation of the article, [http://www.cs.jhu.edu/~scott/pl/lectures/caml-intro.html Introduction to OCaml], to Haskell.

;[[Roll your own IRC bot]]
:This tutorial is designed as a practical guide to writing real world code in Haskell and hopes to intuitively motivate and introduce some of the advanced features of Haskell to the novice programmer, including monad transformers. Our goal is to write a concise, robust and elegant IRC bot in Haskell.

;[http://j-van-thiel.speedlinq.nl/EddyAhmed/GladeGtk2Hs.html Developing Gnome Apps with Glade]
:For the absolute beginner in both Glade and Gtk2Hs and covers the basics of Glade and how to access a .glade file and widgets in Gtk2Hs. Estimated learning time: 2 hours.

;Applications of Functional Programming
:Colin Runciman and David Wakeling (ed.), UCL Press, 1995, ISBN 1-85728-377-5 HB. From the cover:<blockquote>This book is unique in showcasing real, non-trivial applications of functional programming using the Haskell language. It presents state-of-the-art work from the FLARE project and will be an invaluable resource for advanced study, research and implementation.</blockquote>

;[[DealingWithBinaryData]] a guide to bytestrings, the various <tt>Get</tt> monads and the <tt>Put</tt> monad.

===Testing===

;[http://blog.moertel.com/articles/2006/10/31/introductory-haskell-solving-the-sorting-it-out-kata Small overview of QuickCheck]

;[[Introduction to QuickCheck]]

==Reference material==

;[http://haskell.org/haskellwiki/Category:Tutorials A growing list of Haskell tutorials on a diverse range of topics]
:Available on this wiki

;[http://undergraduate.csse.uwa.edu.au/units/230.301/lectureNotes/tourofprelude.html A Tour of the Haskell Prelude (basic functions)]
:By Bernie Pope and Arjan van IJzendoorn.

;[http://cs.anu.edu.au/Student/comp1100/haskell/tourofsyntax.html Tour of the Haskell Syntax]
:By Arjan van IJzendoorn.

;[http://zvon.org/other/haskell/Outputglobal/index.html Haskell Reference]
:By Miloslav Nic.

;[http://members.chello.nl/hjgtuyl/tourdemonad.html A tour of the Haskell Monad functions]
:By Henk-Jan van Tuyl.

;[http://www.cse.unsw.edu.au/~en1000/haskell/inbuilt.html Useful Haskell functions]
:An explanation for beginners of many Haskell functions that are predefined in the Haskell Prelude.

;[http://www.cs.chalmers.se/Cs/Grundutb/Kurser/d1pt/d1pta/ListDoc/ Haskell's Standard List Functions]
:A tour of the standard Haskell functions, directed by what you want to achieve

;[http://haskell.org/ghc/docs/latest/html/libraries/ Documentation for the standard libraries]
:Complete documentation of the standard Haskell libraries.

;[http://www.haskell.org/haskellwiki/Category:Idioms Haskell idioms]
:A collection of articles describing some common Haskell idioms. Often quite advanced.

;[http://www.haskell.org/haskellwiki/Blow_your_mind Useful idioms]
:A collection of short, useful Haskell idioms.

;[http://www.haskell.org/haskellwiki/Programming_guidelines Programming guidelines]
:Some Haskell programming and style conventions.

;[http://www.md.chalmers.se/~rjmh/Combinators/LightningTour/index.htm Lightning Tour of Haskell]
:By John Hughes, as part of a Chalmers programming course

;[http://www.cs.chalmers.se/~augustss/AFP/manuals/haskeller.dvi.gz The Little Haskeller]
:By Cordelia Hall and John Hughes. 9. November 1993, 26 pages. An introduction using the Chalmers Haskell B interpreter (hbi). Beware that it relies very much on the user interface of hbi which is quite different for other Haskell systems, and the tutorials cover Haskell 1.2 , not Haskell 98.

;[http://www.cs.uu.nl/people/jeroen/courses/fp-eng.pdf Functional Programming]
:By Jeroen Fokker, 1995. (153 pages, 600 KB). Textbook for learning functional programming with Gofer (an older implementation of Haskell). Here without Chapters 6 and 7.

== Comparisons to other languages ==

Articles constrasting feature of Haskell with other languages.

;[http://programming.reddit.com/goto?id=nq1k Haskell versus Scheme]
:Mark C. Chu-Carroll, Haskell and Scheme: Which One and Why?

;[http://wiki.python.org/moin/PythonVsHaskell Comparing Haskell and Python]
:A short overview of similarities and differences between Haskell and Python.

;[http://programming.reddit.com/goto?id=nwm2 Monads in OCaml]
:Syntax extension for monads in OCaml

;[http://www.shlomifish.org/lecture/Perl/Haskell/slides/ Haskell for Perl programmers]
:Short intro for perlers

;[[A_brief_introduction_to_Haskell|Introduction to Haskell]] versus [http://www.cs.jhu.edu/~scott/pl/lectures/caml-intro.html Introduction to OCaml].

;[http://www.thaiopensource.com/relaxng/derivative.html An algorithm for RELAX NG validation]
:by James Clark (of RELAX NG fame). Describes an algorithm for validating an XML document against a RELAX NG schema, uses Haskell to describe the algorithm. The algorithm in Haskell and Java is then [http://www.donhopkins.com/drupal/node/117 discussed here].

;[http://mult.ifario.us/articles/2006/10/11/first-steps-with-haskell-for-web-applications Haskell + FastCGI versus Ruby on Rails]
:A short blog entry documenting performance results with ruby on rails and Haskell with fastcgi

;[http://haskell.org/papers/NSWC/jfp.ps Haskell vs. Ada vs. C++ vs. Awk vs. ..., An Experiment in Software Prototyping Productivity] (postscript)
:Paul Hudak and Mark P. Jones, 16 pages.<blockquote>Description of the results of an experiment in which several conventional programming languages, together with the functional language Haskell, were used to prototype a Naval Surface Warfare Center requirement for Geometric Region Servers. The resulting programs and development metrics were reviewed by a committee chosen by the US Navy. The results indicate that the Haskell prototype took significantly less time to develop and was considerably more concise and easier to understand than the corresponding prototypes written in several different imperative languages, including Ada and C++. </blockquote>

;[http://www.osl.iu.edu/publications/prints/2003/comparing_generic_programming03.pdf A Comparative Study of Language Support for Generic Programming] (pdf)
:Ronald Garcia, Jaakko Jrvi, Andrew Lumsdaine, Jeremy G. Siek, and Jeremiah Willcock. In Proceedings of the 2003 ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA'03), October 2003.<blockquote>An interesting comparison of generic programming support across languages, including: Haskell, SML, C++, Java, C#. Haskell supports all constructs described in the paper -- the only language to do so. </blockquote>

;[http://homepages.inf.ed.ac.uk/wadler/realworld/index.html Functional Programming in the Real World]
:A list of functional programs applied to real-world tasks. The main criterion for being real-world is that the program was written primarily to perform some task, not primarily to experiment with functional programming. Functional is used in the broad sense that includes both `pure' programs (no side effects) and `impure' (some use of side effects). Languages covered include CAML, Clean, Erlang, Haskell, Miranda, Scheme, SML, and others.

;[http://www.defmacro.org/ramblings/lisp-in-haskell.html Lisp in Haskell]
:Writing A Lisp Interpreter In Haskell, a tutorial

== Teaching Haskell ==

;[http://www.cs.ukc.ac.uk/pubs/1997/208/index.html Where do I begin? A problem solving approach to teaching functional programming]
:By [http://www.cs.ukc.ac.uk/people/staff/sjt/index.html Simon Thompson]. In Krzysztof Apt, Pieter Hartel, and Paul Klint, editors, First International Conference on Declarative Programming Languages in Education. Springer-Verlag, September 1997. <br> <STRONG>Abstract:</STRONG> This paper introduces a problem solving method for teaching functional programming, based on Polya's `How To Solve It', an introductory investigation of mathematical method. We first present the language independent version, and then show in particular how it applies to the development of programs in Haskell. The method is illustrated by a sequence of examples and a larger case study.

;[http://www.cs.ukc.ac.uk/pubs/1995/214/index.html Functional programming through the curriculum]
:By [http://www.cs.ukc.ac.uk/people/staff/sjt/index.html Simon Thompson] and Steve Hill. In Pieter H. Hartel and Rinus Plasmeijer, editors, Functional Programming Languages in Education, LNCS 1022, pages 85-102. Springer-Verlag, December 1995. <br> <STRONG>Abstract:</STRONG> This paper discusses our experience in using a functional language in topics across the computer science curriculum. After examining the arguments for taking a functional approach, we look in detail at four case studies from different areas: programming language semantics, machine architectures, graphics and formal languages.

;[http://www.cse.unsw.edu.au/~chak/papers/CK02a.html The Risks and Benefits of Teaching Purely Functional Programming in First Year]
:By [http://www.cse.unsw.edu.au/~chak Manuel M. T. Chakravarty] and [http://www.cse.unsw.edu.au/~keller Gabriele Keller]. Journal of Functional Programming 14(1), pp 113-123, 2004. An earlier version of this paper was presented at Functional and Declarative Programming in Education (FDPE02). <br> <strong>Abstract</strong> We argue that teaching purely functional programming as such in freshman courses is detrimental to both the curriculum as well as to promoting the paradigm. Instead, we need to focus on the more general aims of teaching elementary techniques of programming and essential concepts of computing. We support this viewpoint with experience gained during several semesters of teaching large first-year classes (up to 600 students) in Haskell. These classes consisted of computer science students as well as students from other disciplines. We have systematically gathered student feedback by conducting surveys after each semester. This article contributes an approach to the use of modern functional languages in first year courses and, based on this, advocates the use of functional languages in this setting.

==Using monads==

See also the [[Monad]] HaskellWiki page.

===Recommended tutorials===

;[http://www.haskell.org/all_about_monads/html/index.html All About Monads]
:By Jeff Newbern. This tutorial aims to explain the concept of a monad and its application to functional programming in a way that is easy to understand and useful to beginning and intermediate Haskell programmers. Familiarity with the Haskell language is assumed, but no prior experience with monads is required.

;[[Monads as computation]]
:A tutorial which gives a broad overview to motivate the use of monads as an abstraction in functional programming and describe their basic features. It makes an attempt at showing why they arise naturally from some basic premises about the design of a library.

;[[Monads as containers]]
:A tutorial describing monads from a rather different perspective: as an abstraction of container-types, rather than an abstraction of types of computation.

;[http://uebb.cs.tu-berlin.de/~magr/pub/Transformers.en.html Monad Transformers Step by Step]
:By Martin Grabmüller. A small tutorial on using monad transformers. In contrast to others found on the web, it concentrates on using them, not on their implementation.

===Parser===

;[http://www.haskell.org/sitewiki/images/c/c6/ICMI45-paper-en.pdf The Parser monad and other monad (i.e. a monad with state and I/O string)].
:The parser monad is used to build modular, flexible, parsers.

;[http://www.haskell.org/sitewiki/images/c/c6/ICMI45-paper-en.pdf How to build a monadic interpreter in one day] (pdf)
:By Dan Popa. A small tutorial on how to build a language in one day, using the Parser Monad in the front end and a monad with state and I/O string in the back end. Read it if you are interested in learning:
:# language construction and
:# interpreter construction

===More tutorials===

;[http://stefan-klinger.de/files/monadGuide.pdf The Haskell Programmer's Guide to the IO Monad - Don't Panic.]
:By Stefan Klinger. This report scratches the surface of category theory, an abstract branch of algebra, just deep enough to find the monad structure. It seems well written.

;[http://www.prairienet.org/~dsb/monads.htm A (hopefully) painless introduction to monads]
:By Dan Bensen. A straightforward beginner's guide with intuitive explanations and examples.

;[http://www-users.mat.uni.torun.pl/~fly/materialy/fp/haskell-doc/Monads.html What the hell are Monads?]
:By Noel Winstanley. A basic introduction to monads, monadic programming and IO. This introduction is presented by means of examples rather than theory, and assumes a little knowledge of Haskell.

;[http://www.engr.mun.ca/~theo/Misc/haskell_and_monads.htm Monads for the Working Haskell Programmer -- a short tutorial]
:By Theodore Norvell.

;[http://sigfpe.blogspot.com/2006/08/you-could-have-invented-monads-and.html You Could Have Invented Monads! (And Maybe You Already Have.)]
:A short tutorial on monads, introduced from a pragmatic approach, with less category theory references

;[http://www.cs.chalmers.se/~augustss/AFP/monads.html Systematic Design of Monads]
:By John Hughes and Magnus Carlsson. Many useful monads can be designed in a systematic way, by successively adding facilities to a trivial monad. The capabilities that can be added in this way include state, exceptions, backtracking, and output. Here we give a brief description of the trivial monad, each kind of extension, and sketches of some interesting operations that each monad supports.

;[[Meet Bob The Monadic Lover]]
:By Andrea Rossato. A by-the-author-supposed-to-be funny and short introduction to Monads, with code but without any reference to category theory: what monads look like and what they are useful for, from the perspective of a ... lover. (There is also the slightly more serious [[The Monadic Way]] by the same author.)

;[http://www.haskell.org/pipermail/haskell-cafe/2006-November/019190.html Monstrous Monads]
:Andrew Pimlott's humourous introduction to monads, using the metaphor of "monsters".

;Computational monads [http://programming.reddit.com/info/ox6s/comments/coxiv part 1] and [http://programming.reddit.com/info/ox6s/comments/coxoh part 2].

;[http://www.loria.fr/~kow/monads/index.html Of monads and space suits]
:By Eric Kow.

;[[The Monadic Way]]

;Computational monads [http://programming.reddit.com/info/ox6s/comments/coxiv part 1] and [http://programming.reddit.com/info/ox6s/comments/coxoh part 2].

;[http://www.alpheccar.org/fr/posts/show/60 Three kind of monads] : sequencing, side effects or containers

;[[Simple monad examples]]

;[http://en.wikipedia.org/wiki/Monads_in_functional_programming Article on monads on Wikipedia]

;[[IO inside]] page
:Explains why I/O in Haskell is implemented with a monad.

;[http://haskell.org/haskellwiki/Blog_articles#Monads Blog articles]

See also [[Research papers/Monads and arrows]]

==Workshops on advanced functional programming==

;[http://compilers.iecc.com/comparch/article/95-04-024 Advanced Functional Programming: 1st International Spring School on Advanced Functional Programming Techniques], Bastad, Sweden, May 24 - 30, 1995. Tutorial Text (Lecture Notes in Computer Science)

;[http://www.cse.ogi.edu/PacSoft/conf/summerschool96.html Advanced Functional Programming: 2nd International School], Olympia, Wa, Usa, August 26-30, 1996 Tutorial Text (Lecture Notes in Computer Science)

;[http://alfa.di.uminho.pt/~afp98/ Advanced Functional Programming: 3rd International School], AFP'98, Braga, Portugal, September 12-19, 1998, Revised Lectures (Lecture Notes in Computer Science)

;[http://www.cs.uu.nl/~johanj/afp/afp4/ Advanced Functional Programming: 4th International School], AFP 2002, Oxford, UK, August 19-24, 2002, Revised Lectures (Lecture Notes in Computer Science)

;[http://www.cs.ut.ee/afp04/ Advanced Functional Programming: 5th International School], AFP 2004, Tartu, Estonia, August 14-21, 2004, Revised Lectures (Lecture Notes in Computer Science)

More advanced materials available from the [[Conferences|conference proceedings]], and the [[Research papers]] collection.

[[Category:Tutorials]]

Dealing with binary data

2008-01-29T06:31:07Z

AdamLangley:

== Handling Binary Data with Haskell ==

Many programming problems call for the use of binary formats for compactness,
ease-of-use, compatibility or speed. This page quickly covers some common
libraries for handling binary data in Haskell.

=== ByteStrings ===

Everything else in this tutorial will be based on bytestrings. Normal Haskell
<hask>String</hask> types are linked lists of 32-bit charactors. This has a
number of useful properties like coverage of the Unicode space and lazyness,
however when it comes to dealing with byte-wise data, <hask>String</hask>
involves a space-inflation of about 24x and a large reduction in speed.

Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a <code>uint8_t[]</code>
- although bytestrings know their length and don't allow overflows etc.

Their are two major flavours of bytestrings, strict and lazy. Strict
bytestrings are exactly what you would expect - a linear array of bytes in
memory. Lazy bytestrings are a list of strict bytestrings, often this is called
a cord in other languages. When reading a lazy bytestring from a file, the data
will be read chunk by chunk and the file can be larger than the size of memory.
The default chunk size is currently 32K.

Within each flavour of bytestring comes the Word8 and Char8 versions. These are
mostly an aid to the type system since they are fundamentally the same size of
element. The Word8 unpacks as a list of <hask>Word8</hask> elements (bytes),
the Char8 unpacks as a list of <hask>Char</hask>, which may be useful if you
want to convert them to <hask>Strings</hask>

You might want to open the documentation for
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html strict bytestrings] and
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Lazy.html lazy bytestrings]
in another tab so that you can follow along.

==== Simple file IO ====

Here's a very simple program which copies a file from standard input to
standard output

<haskell>module Main where

import qualified Data.ByteString as B

main :: IO ()
main = do
contents <- B.getContents
B.putStr contents</haskell>

Note that we are using strict bytestrings here. (It's quite common to import the
<code>ByteString</code> module under the names <code>B</code> or <code>BS</code>.)
Since the bytestrings are strict the code will read the whole of stdin into
memory and then write it out. If the input was too large this would overflow
the availble memory and fail.

Let's see the same program using lazy bytestrings. We are just changing the
imported ByteString module to be the lazy one and calling the exact same
functions from the new module:

<haskell>module Main where

import qualified Data.ByteString.Lazy as BL

main :: IO ()
main = do
contents <- BL.getContents
BL.putStr contents</haskell>

This code, because of the lazy bytestrings, will cope with any sized input and
will start producing output before all the input has been read. You can think
of the code as setting up a pipeline, rather than executing in-order, as you
might expect. As <hask>putStr</hask> needs more data, it will cause the lazy
bytestring <hask>contents</hask> to read more until the end of the input is
found.

You should review the [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html documentation]
which lists all the functions which operate on ByteStrings. The documentation
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
generally find the same functions in each, with the same names. Remember to
import the modules as <code>qualified</code> and give them different names.

==== The guts of ByteStrings ====

I'll just mention in passing that sometimes you need to do something which would
endanger the referential transparency of ByteStrings. Generally you only need
to do this when using the FFI to interface with C libraries. Should such a need
arise, you can have a look at the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Internal.html internal functions] and the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Unsafe.html unsafe functions].
Remember that the last set of functions are called unsafe for a reason - misuse
can crash you program!

=== Binary parsing ===

Once you have your data as a bytestring you'll be wanting to parse something
from it. Here you need to install the
<tt>[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1 binary]</tt> package.
Instructions for installing Cabal packages are out of scope for this tutorial,
but should be fairly easy to find.

The <tt>binary</tt> package has three major parts: the <code>Get</code> monad,
the <code>Put</code> monad and a general serialisation for Haskell types. The
latter is like the <tt>pickle</tt> module that you may know from Python - it
has its own serialisation format and I won't be covering it any more here.
However, if you just need to persist some Haskell data structures, it might be
exactly what you want: the documentation is
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html here]

==== The <tt>Get</tt> monad ====

The <tt>Get</tt> monad is a state monad; it keeps some state and each action
updates that state. The state in this case is an offset into the bytestring
which is getting parsed. <tt>Get</tt> parses lazy bytestrings, this is how
packages like
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]
can parse files several gigabytes long in constant memory - they are using a
pipeline of lazy bytestrings. However, this also has a downside. When parsing a
lazy bytestring a parse failure (such as running off the end of the bytestring)
is signified by an exception. Exceptions can only be caught in the IO monad
and, because of lazyness, might not be thrown exactly where you expect. If this
is a problem you probably want a strict version of <tt>Get</tt>, which is
covered below.

Here's an example of using the <tt>Get</tt> monad:

<haskell>import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- BL.getContents
print $ runGet deserialiseHeader input</haskell>

This code takes 3, big-endian, 32-bit unsigned numbers from the input string
and returns them as a tuple. Let's try running it:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(825373492,825373492,825373493)</pre>

Makes sense, right? Look what happens if the input is too short:

<pre>% runhaskell /tmp/example.hs << EOF
tooshort
EOF
(1953460083,1752134260,example.hs: too few bytes. Failed reading at byte position 12</pre>

Here an exception was thrown because we ran out of bytes.

So the <tt>Get</tt> monad consists of a set of operations like
<hask>getWord32be</hask> which walk over the input and return some type of
data. You can see the full list of those functions in the
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Get.html documentation].

Here's another example; decoding an EOF terminated list of
numbers list just involves recursion:

<haskell>listOfWord16 = do
empty <- isEmpty
if empty
then return []
else do v <- getWord64be
rest <- listOfWord16
return (v : rest)</haskell>

==== Strict <tt>Get</tt> monad ====

If you're parsing small messages then, firstly your input isn't going to be a
lazy bytestring but a strict one. That's not reallly a problem because you can
easilly convert between them. However, if you want to handle parse failures you
either have to write your parser very carefully, or you have to deal with the
fact that you can only catch exceptions in the IO monad.

If this is your dilemma, then you need a strict version of the <tt>Get</tt>
monad. It's almost exactly the same, but a parser of type <hask>Get a</hask>
results in <hask>(Either String a, ByteString)</hask> as the result of
<hask>runGet</hask>. That type is a tuple where the first value is ''either'' a
string (an error string from the parse) or the result, and the second value is
the remaining bytestring when the parser finished.

Let's update the first example with this strict version of <tt>Get</tt>. You'll
have to install the
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict-0.2.2 binary-strict]
package for it to work.

<haskell>import qualified Data.ByteString as B
import Data.Binary.Strict.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- B.getContents
print $ runGet deserialiseHeader input</haskell>

Note that all we're done is change from lazy bytestrings to strict bytestrings
and change change to importing <tt>Data.Binary.Strict.Get</tt>. Now we'll run
it again:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(Right (825373492,825373492,825373493),"\n")</pre>

Now we can see that the parser was successful (we got a <tt>Right</tt>) and we
can see that our shell actually added an extra newline on the input (correctly)
and the parser didn't consume that, so it's also returned to us. Now we try it
with a truncated input:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> tooshort
heredoc> EOF
(Left "too few bytes","\n")</pre>

This time we didn't get an exception, but a <tt>Left</tt> value, which can be
handled in pure code. The remaining bytestring is the same because our
truncated input is 9 bytes long, parsing the first two <tt>Word32</tt>'s
consumed 8 bytes and parsing the third failed - at which point we had the last
byte still in the input.

In your parser, you can also call <hask>fail</hask>, with an error string,
which will result in a <tt>Left</tt> value.

That's it; it's otherwise the same as the <tt>Get</tt> monad.

====Bit twiddling====

Even with all this monadic goodness, sometimes you just need to move some bits
around. That's perfectly possible in Haskell too. Just import
<tt>Data.Bits</tt> and use the following table.

<table>
<tr><th>Name</th><th>C operator</th><th>Haskell</th></tr>
<tr><td>AND</td><td><tt>&</tt></td><td><hask>.&.</hask></td></tr>
<tr><td>OR</td><td><tt>|</tt></td><td><hask>.|.</hask></td></tr>
<tr><td>XOR</td><td><tt>^</tt></td><td><hask>`xor`</hask></td></tr>
<tr><td>NOT</td><td><tt>¬</tt></td><td><hask>`complement`</hask></td></tr>
<tr><td>Left shift</td><td><tt><<</tt></td><td><hask>`shiftL`</hask></td></tr>
<tr><td>Right shift</td><td><tt>>></tt></td><td><hask>`shiftR`</hask></td></tr>
</table>

====The <tt>BitGet</tt> monad====

As an alternative to bit twiddling, you can also use the <tt>BitGet</tt> monad.
This is another state-like monad, like <tt>Get</tt>, but here the state
includes the current bit-offest in the input. This means that you can easily pull out
unaligned data. Sadly, haddock is currently breaking when trying to generate the
documentation for <tt>BitGet</tt> so I'll start with an example. Again, you'll
need the
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict-0.2.2 binary-strict] package installed.

Here's a description of the header of a DNS packet, direct from RFC 1035:

<pre> 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+</pre>

The actual fields don't matter, but here's a function for parsing it:

<haskell>parseHeader :: G.Get Header
parseHeader = do
id <- G.getWord16be
flags <- G.getByteString 2
qdcount <- G.getWord16be >>= return . fromIntegral
ancount <- G.getWord16be >>= return . fromIntegral
nscount <- G.getWord16be >>= return . fromIntegral
arcount <- G.getWord16be >>= return . fromIntegral

let r = BG.runBitGet flags (do
isquery <- BG.getBit
opcode <- BG.getAsWord8 4 >>= parseEnum
aa <- BG.getBit
tc <- BG.getBit
rd <- BG.getBit
ra <- BG.getBit

BG.getAsWord8 3
rcode <- BG.getAsWord8 4 >>= parseEnum

return $ Header id isquery opcode aa tc rd ra rcode qdcount ancount nscount arcount)

case r of
Left error -> fail error
Right x -> return x</haskell>

Here you can see that only the second line (from the ASCII-art diagram) is
parsed using <tt>BitGet</tt>. An outer <tt>Get</tt> monad is used for
everythign else and the bit fields are pulled out with
<hask>getByteString</hask>. Again, <tt>BitGet</tt> is a strict monad and
returns an <tt>Either</tt>, but it doesn't return the remaining bytestring,
just because there's no obvious way to represent a bytestring of a fractional
number of bytes.

You can see the list of <tt>BitGet</tt> functions and their comments in the
[http://darcs.imperialviolet.org/darcsweb.cgi?r=binary-strict;a=headblob;f=/src/Data/Binary/Strict/BitGet.hs source code].

===Binary generation===

In contrast to parsing binary data, you might want to generate it. This is the
job of the <tt>Put</tt> monad. Follow along with the
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Put.html documentation]
if you like.

The <tt>Put</tt> monad is another state-like monad, but the state is an offset
into a series of buffers where the generated data is placed. All the buffer
creation and handling is done for you, so you can just forget about it. It
results in a lazy bytestring (so you can generate outputs that are larger than memory).

Here's the reverse of our simple <tt>Get</tt> example:

<haskell>import qualified Data.ByteString.Lazy as BL
import Data.Binary.Put

serialiseSomething :: Put
serialiseSomething = do
putWord32be 1
putWord16be 2
putWord8 3

main :: IO ()
main = BL.putStr $ runPut serialiseSomething</haskell>

And running it shows that it's generating the correct serialisation:

<pre>% runhaskell /tmp/example.hs| hexdump -C
00000000 00 00 00 01 00 02 03 |.......|</pre>

If you want the output of <tt>runPut</tt> to be a strict bytestring, you just
need to convert it with <hask>B.concat $ BL.toChunks $ runPut xyz</hask>.

One limitation of <tt>Put</tt>, due to the nature of the <tt>Builder</tt> monad
which it works with, is that you can't get the current offset into the output.
This can be an issue with some formats which require you to encode byte offsets
into the file. You have to calculate these byte offsets yourself.

=== Other useful packages ===

There are other packages which you should know about, but which are mostly
covered by their documentation:

* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/network-bytestring-0.1.1 network-bytestring]: for reading and writing bytestring from the network
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.4.0.2 zlib] and [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.4.0.1 bzlib]: for compressed formats
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.3 encoding]: for dealing with charactor encodings
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]: as an example of lazy parsing/serialisation

Dealing with binary data

2008-01-29T06:21:57Z

AdamLangley:

== Handling Binary Data with Haskell ==

Many programming problems call for the use of binary formats for compactness,
ease-of-use, compatibility or speed. This page quickly covers some common
libraries for handling binary data in Haskell.

=== ByteStrings ===

Everything else in this tutorial will be based on bytestrings. Normal Haskell
<hask>String</hask> types are linked lists of 32-bit charactors. This has a
number of useful properties like coverage of the Unicode space and lazyness,
however when it comes to dealing with byte-wise data the <hask>String</hask>
involves a space-inflation of about 24x and a large reduction in speed.

Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a <code>uint8_t[]</code>
- although bytestrings know their length and don't allow overflows etc.

Their are two major flavours of bytestrings, strict and lazy. Strict
bytestrings are exactly what you would expect - a linear array of bytes in
memory. Lazy bytestrings are a list of strict bytestrings, often this is called
a cord in other languages. When reading a lazy bytestring from a file, the data
will be read chunk by chunk and the file can be larger than the size of memory.
The default chunk size is currently 32K.

Within each flavour of bytestring comes the Word8 and Char8 versions. These are
mostly an aid to the type system since they are fundamentally the same size of
element. The Word8 unpacks as a list of <hask>Word8</hask> elements (bytes),
the Char8 unpacks as a list of <hask>Char</hask>, which may be useful if you
want to convert them to <hask>Strings</hask>

You might want to open the documentation for
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html strict bytestrings] and
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Lazy.html lazy bytestrings]
in another tab so that you can follow along.

==== Simple file IO ====

Here's a very simple program which copies a file from standard input to
standard output

<haskell>module Main where

import qualified Data.ByteString as B

main :: IO ()
main = do
contents <- B.getContents
B.putStr contents</haskell>

Note that we are using strict bytestrings here. (It's quite common to import the
<code>ByteString</code> module under the names <code>B</code> or <code>BS</code>.)
Since the bytestrings are strict the code will read the whole of stdin into
memory and then write it out. If the input was too large this would overflow
the availble memory and fail.

Let's see the same program using lazy bytestrings. We are just changing the
imported ByteString module to be the lazy one and calling the exact same
functions from the new module:

<haskell>module Main where

import qualified Data.ByteString.Lazy as BL

main :: IO ()
main = do
contents <- BL.getContents
BL.putStr contents</haskell>

This code, because of the lazy bytestrings, will cope with any sized input and
will start producing output before all the input has been read. You can think
of the code as setting up a pipeline, rather than executing in-order, as you
might expect. As <hask>putStr</hask> needs more data, it will cause the lazy
bytestring <hask>contents</hask> to read more until the end of the input is
found.

You should review the [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html documentation]
which lists all the functions which operate on ByteStrings. The documentation
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
generally find the same functions in each, with the same names. Remember to
import the modules as <code>qualified</code> and give them different names.

==== The Guts of ByteStrings ====

I'll just mention in passing that sometimes you need to do something which would
endanger the referential transparency of ByteStrings. Generally you only need
to do this when using the FFI to interface with C libraries. Should such a need
arise, you can have a look at the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Internal.html internal functions] and the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Unsafe.html unsafe functions].
Remember that the last set of functions are called unsafe for a reason - misuse
can crash you program!.

=== Binary parsing ===

Once you have your data as a bytestring you'll be wanting to parse something
from it. Here you need to install the
<tt>[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1 binary]</tt> package.
Instructions for installing Cabal packages are out of scope for this tutorial,
but should be fairly easy to find.

The <tt>binary</tt> package has three major parts: the <code>Get</code> monad,
the <code>Put</code> monad and a general serialisation for Haskell types. The
latter is like the <tt>pickle</tt> module that you may know from Python - it
has it's own serialisation format and I won't be covering it any more here.
However, if you just need to persist some Haskell data structures, it might be
exactly what you want: the documentation is
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html here]

==== The <tt>Get</tt> monad ====

The <tt>Get</tt> monad is a state monad; it keeps some state and each action
updates that state. The state in this case is an offset into the bytestring
which is getting parsed. <tt>Get</tt> parses lazy bytestrings, this is how
packages like
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]
can parse files several gigabytes long in constant memory - they are using a
pipeline of lazy bytestrings. However, this also has a downside. When parsing a
lazy bytestring a parse failure (such as running off the end of the bytestring)
is signified by an exception. Exceptions can only be caught in the IO monad
and, because of lazyness, might not be thrown exactly where you expect. If this
is a problem you probably want a strict version of <tt>Get</tt>, which is
covered below.

Here's an example of using the <tt>Get</tt> monad:

<haskell>import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- BL.getContents
print $ runGet deserialiseHeader input</haskell>

This code takes 3, big-endian, 32-bit unsigned numbers from the input string
and returns them as a tuple. Let's try running it:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(825373492,825373492,825373493)</pre>

Makes sense, right? Look what happens if the input is too short:

<pre>% runhaskell /tmp/example.hs << EOF
tooshort
EOF
(1953460083,1752134260,example.hs: too few bytes. Failed reading at byte position 12</pre>

Here an exception was thrown because we ran out of bytes.

So the <tt>Get</tt> monad consists of a set of operations like
<hask>getWord32be</hask> which walk over the input and return some type of
data. You can see the full list of those functions in the
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Get.html documentation].

Here's another example; decoding an EOF terminated list of
numbers list just involves recursion:

<haskell>listOfWord16 = do
empty <- isEmpty
if empty
then return []
else do v <- getWord64be
rest <- listOfWord16
return (v : rest)</haskell>

==== Strict <tt>Get</tt> monad ====

If you're parsing small messages then, firstly your input isn't going to be a
lazy bytestring but a string one. That's not reallly a problem because you can
easilly convert between them. However, if you want to handle parse failures you
either have to write your parser very carefully, or you have to deal with the
fact that you can only catch exceptions in the IO monad.

If this is your dilemma, then you need a strict version of the <tt>Get</tt>
monad. It's almost exactly the same, but a parser of type <hask>Get a</hask>
results in <hask>(Either String a, ByteString)</hask> as the result of
<hask>runGet</hask>. That type is a tuple where the first value is ''either'' a
string (an error string from the parse) or the result, and the second value is
the remaining bytestring when the parser finished.

Let's update the first example with this strict version of <tt>Get</tt>. You'll
have to install the
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict-0.2.2 binary-strict]
package for it to work.

<haskell>import qualified Data.ByteString as B
import Data.Binary.Strict.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- B.getContents
print $ runGet deserialiseHeader input</haskell>

Note that all we're done is change from lazy bytestrings to strict bytestrings
and change change to importing <tt>Data.Binary.Strict.Get</tt>. Now we'll run
it again:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(Right (825373492,825373492,825373493),"\n")</pre>

Now we can see that the parser was successful (we got a <tt>Right</tt>) and we
can see that our shell actually added an extra newline on the input (correctly)
and the parser didn't consume that, so it's also returned to us. Now we try it
with a truncated input:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> tooshort
heredoc> EOF
(Left "too few bytes","\n")</pre>

This time we didn't get an exception, but a <tt>Left</tt> value, which can be
handled in pure code. The remaining bytestring is the same because our
truncated input is 9 bytes long, parsing the first two <tt>Word32</tt>'s
consumed 8 bytes and parsing the third failed - at which point we had the last
byte still in the input.

In your parser, you can also call <hask>fail</hask>, with an error string,
which will result in a <tt>Left</tt> value.

====Bit twiddling====

Even with all this monadic goodness, sometimes you just need to move some bits
around. That's perfectly possible in Haskell too. Just import
<tt>Data.Bits</tt> and use the following table.

<table>
<tr><th>Name</th><th>C operator</th><th>Haskell</th></tr>
<tr><td>AND</td><td><tt>&</tt></td><td><hask>.&.</hask></td></tr>
<tr><td>OR</td><td><tt>|</tt></td><td><hask>.|.</hask></td></tr>
<tr><td>XOR</td><td><tt>^</tt></td><td><hask>`xor`</hask></td></tr>
<tr><td>NOT</td><td><tt>¬</tt></td><td><hask>`complement`</hask></td></tr>
<tr><td>Left shift</td><td><tt><<</tt></td><td><hask>`shiftL`</hask></td></tr>
<tr><td>Right shift</td><td><tt>>></tt></td><td><hask>`shiftR`</hask></td></tr>
</table>

====The <tt>BitGet</tt> monad====

As an alternative to bit twiddling, you can also use the <tt>BitGet</tt> monad.
This is another state-like monad, like <tt>Get</tt>, but here the state
includes the current bit-offest in the input. This means that you can easily pull out
unaligned data. Sadly, haddock is current breaking when trying to generate the
documentation for <tt>BitGet</tt> so I'll start with an example.

Here's a description of the header of a DNS packet, direct from RFC 1035:

<pre> 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+</pre>

The actual fields don't matter, but here's a function for parsing it:

<haskell>parseHeader :: G.Get Header
parseHeader = do
id <- G.getWord16be
flags <- G.getByteString 2
qdcount <- G.getWord16be >>= return . fromIntegral
ancount <- G.getWord16be >>= return . fromIntegral
nscount <- G.getWord16be >>= return . fromIntegral
arcount <- G.getWord16be >>= return . fromIntegral

let r = BG.runBitGet flags (do
isquery <- BG.getBit
opcode <- BG.getAsWord8 4 >>= parseEnum
aa <- BG.getBit
tc <- BG.getBit
rd <- BG.getBit
ra <- BG.getBit

BG.getAsWord8 3
rcode <- BG.getAsWord8 4 >>= parseEnum

return $ Header id isquery opcode aa tc rd ra rcode qdcount ancount nscount arcount)

case r of
Left error -> fail error
Right x -> return x</haskell>

Here you can see that only the second line (from the ASCII-art diagram) is
parsed using <tt>BitGet</tt>. An outer <tt>Get</tt> monad is used for
everythign else and the bit fields are pulled out with
<hask>getByteString</hask>. Again, <tt>BitGet</tt> is a strict monad and
returns an <tt>Either</tt>, but it doesn't return the remaining bytestring,
just because there's no obvious way to represent a bytestring of a fractional
number of bytes.

You can see the list of <tt>BitGet</tt> functions and their comments in the
[http://darcs.imperialviolet.org/darcsweb.cgi?r=binary-strict;a=headblob;f=/src/Data/Binary/Strict/BitGet.hs source code].

===Binary generation===

In contrast to parsing binary data, you might want to generate it. This is the
job of the <tt>Put</tt> monad. Follow along with the
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Put.html documentation]
if you like.

The <tt>Put</tt> monad is another state-like monad, but the state is an offset
into a series of buffers where the generated data is placed. All the buffer
creation and handling is done for you, so you can just forget about it. It
results in a lazy bytestring (so you can generate output's larger than memory).

Here's the reverse of our simple <tt>Get</tt> example:

<haskell>import qualified Data.ByteString.Lazy as BL
import Data.Binary.Put

serialiseSomething :: Put
serialiseSomething = do
putWord32be 1
putWord16be 2
putWord8 3

main :: IO ()
main = BL.putStr $ runPut serialiseSomething</haskell>

And running it shows that it's generating the correct serialisation:

<pre>% runhaskell /tmp/example.hs| hexdump -C
00000000 00 00 00 01 00 02 03 |.......|</pre>

If you want the output of <tt>runPut</tt> to be a strict bytestring, you just
need to convert it with <hask>B.concat $ BL.toChunks $ runPut xyz</hask>.

One limitation of <tt>Put</tt>, due to the nature of the <tt>Builder</tt> monad
which it works with, is that you can't get the current offset into the output.
This can be an issue with some formats which require you to encode byte offsets
into the file. You have to calculate these byte offsets yourself.

=== Other useful packages ===

There are other packages which you should know about, but which are mostly
covered by their documentation:

* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/network-bytestring-0.1.1 network-bytestring]: for reading and writing bytestring from the network
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.4.0.2 * zlib] and [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.4.0.1 bzlib]: for compressed formats
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.3 encoding]: for dealing with charactor encodings
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]: as an example of lazy parsing/serialisation

Dealing with binary data

2008-01-29T06:08:42Z

AdamLangley:

== Handling Binary Data with Haskell ==

Many programming problems call for the use of binary formats for compactness,
ease-of-use, compatibility or speed. This page quickly covers some common
libraries for handling binary data in Haskell.

=== ByteStrings ===

Everything else in this tutorial will be based on bytestrings. Normal Haskell
<hask>String</hask> types are linked lists of 32-bit charactors. This has a
number of useful properties like coverage of the Unicode space and lazyness,
however when it comes to dealing with byte-wise data the <hask>String</hask>
involves a space-inflation of about 24x and a large reduction in speed.

Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a <code>uint8_t[]</code>
- although bytestrings know their length and don't allow overflows etc.

Their are two major flavours of bytestrings, strict and lazy. Strict
bytestrings are exactly what you would expect - a linear array of bytes in
memory. Lazy bytestrings are a list of strict bytestrings, often this is called
a cord in other languages. When reading a lazy bytestring from a file, the data
will be read chunk by chunk and the file can be larger than the size of memory.
The default chunk size is currently 32K.

Within each flavour of bytestring comes the Word8 and Char8 versions. These are
mostly an aid to the type system since they are fundamentally the same size of
element. The Word8 unpacks as a list of <hask>Word8</hask> elements (bytes),
the Char8 unpacks as a list of <hask>Char</hask>, which may be useful if you
want to convert them to <hask>Strings</hask>

You might want to open the documentation for
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html strict bytestrings] and
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Lazy.html lazy bytestrings]
in another tab so that you can follow along.

==== Simple file IO ====

Here's a very simple program which copies a file from standard input to
standard output

<haskell>module Main where

import qualified Data.ByteString as B

main :: IO ()
main = do
contents <- B.getContents
B.putStr contents</haskell>

Note that we are using strict bytestrings here. (It's quite common to import the
<code>ByteString</code> module under the names <code>B</code> or <code>BS</code>.)
Since the bytestrings are strict the code will read the whole of stdin into
memory and then write it out. If the input was too large this would overflow
the availble memory and fail.

Let's see the same program using lazy bytestrings. We are just changing the
imported ByteString module to be the lazy one and calling the exact same
functions from the new module:

<haskell>module Main where

import qualified Data.ByteString.Lazy as BL

main :: IO ()
main = do
contents <- BL.getContents
BL.putStr contents</haskell>

This code, because of the lazy bytestrings, will cope with any sized input and
will start producing output before all the input has been read. You can think
of the code as setting up a pipeline, rather than executing in-order, as you
might expect. As <hask>putStr</hask> needs more data, it will cause the lazy
bytestring <hask>contents</hask> to read more until the end of the input is
found.

You should review the [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html documentation]
which lists all the functions which operate on ByteStrings. The documentation
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
generally find the same functions in each, with the same names. Remember to
import the modules as <code>qualified</code> and give them different names.

==== The Guts of ByteStrings ====

I'll just mention in passing that sometimes you need to do something which would
endanger the referential transparency of ByteStrings. Generally you only need
to do this when using the FFI to interface with C libraries. Should such a need
arise, you can have a look at the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Internal.html internal functions] and the
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Unsafe.html unsafe functions].
Remember that the last set of functions are called unsafe for a reason - misuse
can crash you program!.

=== Binary parsing ===

Once you have your data as a bytestring you'll be wanting to parse something
from it. Here you need to install the
<tt>[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1 binary]</tt> package.
Instructions for installing Cabal packages are out of scope for this tutorial,
but should be fairly easy to find.

The <tt>binary</tt> package has three major parts: the <code>Get</code> monad,
the <code>Put</code> monad and a general serialisation for Haskell types. The
latter is like the <tt>pickle</tt> module that you may know from Python - it
has it's own serialisation format and I won't be covering it any more here.
However, if you just need to persist some Haskell data structures, it might be
exactly what you want: the documentation is
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html here]

==== The <tt>Get</tt> monad ====

The <tt>Get</tt> monad is a state monad; it keeps some state and each action
updates that state. The state in this case is an offset into the bytestring
which is getting parsed. <tt>Get</tt> parses lazy bytestrings, this is how
packages like
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]
can parse files several gigabytes long in constant memory - they are using a
pipeline of lazy bytestrings. However, this also has a downside. When parsing a
lazy bytestring a parse failure (such as running off the end of the bytestring)
is signified by an exception. Exceptions can only be caught in the IO monad
and, because of lazyness, might not be thrown exactly where you expect. If this
is a problem you probably want a strict version of <tt>Get</tt>, which is
covered below.

Here's an example of using the <tt>Get</tt> monad:

<haskell>import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- BL.getContents
print $ runGet deserialiseHeader input</haskell>

This code takes 3, big-endian, 32-bit unsigned numbers from the input string
and returns them as a tuple. Let's try running it:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(825373492,825373492,825373493)</pre>

Makes sense, right? Look what happens if the input is too short:

<pre>% runhaskell /tmp/example.hs << EOF
tooshort
EOF
(1953460083,1752134260,example.hs: too few bytes. Failed reading at byte position 12</pre>

Here an exception was thrown because we ran out of bytes.

So the <tt>Get</tt> monad consists of a set of operations like
<hask>getWord32be</hask> which walk over the input and return some type of
data. You can see the full list of those functions in the
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Get.html documentation].

Here's another example; decoding an EOF terminated list of
numbers list just involves recursion:

<haskell>listOfWord16 = do
empty <- isEmpty
if empty
then return []
else do v <- getWord64be
rest <- listOfWord16
return (v : rest)</haskell>

==== Strict <tt>Get</tt> monad ====

If you're parsing small messages then, firstly your input isn't going to be a
lazy bytestring but a string one. That's not reallly a problem because you can
easilly convert between them. However, if you want to handle parse failures you
either have to write your parser very carefully, or you have to deal with the
fact that you can only catch exceptions in the IO monad.

If this is your dilemma, then you need a strict version of the <tt>Get</tt>
monad. It's almost exactly the same, but a parser of type <hask>Get a</hask>
results in <hask>(Either String a, ByteString)</hask> as the result of
<hask>runGet</hask>. That type is a tuple where the first value is ''either'' a
string (an error string from the parse) or the result, and the second value is
the remaining bytestring when the parser finished.

Let's update the first example with this strict version of <tt>Get</tt>. You'll
have to install the
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict-0.2.2 binary-strict]
package for it to work.

<haskell>import qualified Data.ByteString as B
import Data.Binary.Strict.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)

main :: IO ()
main = do
input <- B.getContents
print $ runGet deserialiseHeader input</haskell>

Note that all we're done is change from lazy bytestrings to strict bytestrings
and change change to importing <tt>Data.Binary.Strict.Get</tt>. Now we'll run
it again:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(Right (825373492,825373492,825373493),"\n")</pre>

Now we can see that the parser was successful (we got a <tt>Right</tt>) and we
can see that our shell actually added an extra newline on the input (correctly)
and the parser didn't consume that, so it's also returned to us. Now we try it
with a truncated input:

<pre>% runhaskell /tmp/example.hs << EOF
heredoc> tooshort
heredoc> EOF
(Left "too few bytes","\n")</pre>

This time we didn't get an exception, but a <tt>Left</tt> value, which can be
handled in pure code. The remaining bytestring is the same because our
truncated input is 9 bytes long, parsing the first two <tt>Word32</tt>'s
consumed 8 bytes and parsing the third failed - at which point we had the last
byte still in the input.

In your parser, you can also call <hask>fail</hask>, with an error string,
which will result in a <tt>Left</tt> value.

====Bit twiddling====

Even with all this monadic goodness, sometimes you just need to move some bits
around. That's perfectly possible in Haskell too. Just import
<tt>Data.Bits</tt> and use the following table.

<table>
<tr><th>Name</th><th>C operator</th><th>Haskell</th></tr>
<tr><td>AND</td><td><tt>&</tt></td><td><hask>.&.</hask></td></tr>
<tr><td>OR</td><td><tt>|</tt></td><td><hask>.|.</hask></td></tr>
<tr><td>XOR</td><td><tt>^</tt></td><td><hask>`xor`</hask></td></tr>
<tr><td>NOT</td><td><tt>¬</tt></td><td><hask>`complement`</hask></td></tr>
<tr><td>Left shift</td><td><tt><<</tt></td><td><hask>`shiftL`</hask></td></tr>
<tr><td>Right shift</td><td><tt>>></tt></td><td><hask>`shiftR`</hask></td></tr>
</table>

====The <tt>BitGet</tt> monad====

As an alternative to bit twiddling, you can also use the <tt>BitGet</tt> monad.
This is another state-like monad, like <tt>Get</tt>, but here the state
includes the current bit-offest in the input. This means that you can easily pull out
unaligned data. Sadly, haddock is current breaking when trying to generate the
documentation for <tt>BitGet</tt> so I'll start with an example.

Here's a description of the header of a DNS packet, direct from RFC 1035:

<pre> 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+</pre>

The actual fields don't matter, but here's a function for parsing it:

<haskell>parseHeader :: G.Get Header
parseHeader = do
id <- G.getWord16be
flags <- G.getByteString 2
qdcount <- G.getWord16be >>= return . fromIntegral
ancount <- G.getWord16be >>= return . fromIntegral
nscount <- G.getWord16be >>= return . fromIntegral
arcount <- G.getWord16be >>= return . fromIntegral

let r = BG.runBitGet flags (do
isquery <- BG.getBit
opcode <- BG.getAsWord8 4 >>= parseEnum
aa <- BG.getBit
tc <- BG.getBit
rd <- BG.getBit
ra <- BG.getBit

BG.getAsWord8 3
rcode <- BG.getAsWord8 4 >>= parseEnum

return $ Header id isquery opcode aa tc rd ra rcode qdcount ancount nscount arcount)

case r of
Left error -> fail error
Right x -> return x</haskell>

Here you can see that only the second line (from the ASCII-art diagram) is
parsed using <tt>BitGet</tt>. An outer <tt>Get</tt> monad is used for
everythign else and the bit fields are pulled out with
<hask>getByteString</hask>. Again, <tt>BitGet</tt> is a strict monad and
returns an <tt>Either</tt>, but it doesn't return the remaining bytestring,
just because there's no obvious way to represent a bytestring of a fractional
number of bytes.

You can see the list of <tt>BitGet</tt> functions and their comments in the
[http://darcs.imperialviolet.org/darcsweb.cgi?r=binary-strict;a=headblob;f=/src/Data/Binary/Strict/BitGet.hs source code].

Dealing with binary data

2008-01-29T05:54:49Z

AdamLangley:

Dealing with binary data

2008-01-29T05:53:54Z

AdamLangley:

Dealing with binary data

2008-01-29T05:52:30Z

AdamLangley:

Dealing with binary data

2008-01-29T03:37:20Z

AdamLangley:

Dealing with binary data

2008-01-29T03:15:03Z

AdamLangley:

Dealing with binary data

2008-01-28T23:08:37Z

AdamLangley:

Dealing with binary data

2008-01-28T20:18:03Z

AdamLangley: Incremental saving: page not ready

User:AdamLangley

2008-01-28T20:17:06Z

AdamLangley:

[[DealingWithBinaryData]]

User:AdamLangley

2008-01-28T20:16:49Z

AdamLangley:

DealingWithBinaryData

User:AdamLangley

2008-01-28T20:16:26Z

AdamLangley:

[DealingWithBinaryData]