Dealing with binary data
Revision as of 06:08, 29 January 2008
1 Handling Binary Data with Haskell
Many programming problems call for the use of binary formats for compactness, ease-of-use, compatibility or speed. This page quickly covers some common libraries for handling binary data in Haskell.
Everything else in this tutorial will be based on bytestrings. Normal Haskell
number of useful properties like coverage of the Unicode space and lazyness,however when it comes to dealing with byte-wise data the
involves a space-inflation of about 24x and a large reduction in speed.
Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a
- although bytestrings know their length and don't allow overflows etc.
Their are two major flavours of bytestrings, strict and lazy. Strict bytestrings are exactly what you would expect - a linear array of bytes in memory. Lazy bytestrings are a list of strict bytestrings, often this is called a cord in other languages. When reading a lazy bytestring from a file, the data will be read chunk by chunk and the file can be larger than the size of memory. The default chunk size is currently 32K.
Within each flavour of bytestring comes the Word8 and Char8 versions. These are mostly an aid to the type system since they are fundamentally the same size ofelement. The Word8 unpacks as a list of
1.1.1 Simple file IO
Here's a very simple program which copies a file from standard input to standard output
module Main where import qualified Data.ByteString as B main :: IO () main = do contents <- B.getContents B.putStr contents
Note that we are using strict bytestrings here. (It's quite common to import the
ByteString module under the names
Since the bytestrings are strict the code will read the whole of stdin into
memory and then write it out. If the input was too large this would overflow
the availble memory and fail.
Let's see the same program using lazy bytestrings. We are just changing the imported ByteString module to be the lazy one and calling the exact same functions from the new module:
module Main where import qualified Data.ByteString.Lazy as BL main :: IO () main = do contents <- BL.getContents BL.putStr contents
This code, because of the lazy bytestrings, will cope with any sized input and will start producing output before all the input has been read. You can think of the code as setting up a pipeline, rather than executing in-order, as youmight expect. As
You should review the documentation
which lists all the functions which operate on ByteStrings. The documentation
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
generally find the same functions in each, with the same names. Remember to
import the modules as
qualified and give them different names.
1.1.2 The Guts of ByteStrings
I'll just mention in passing that sometimes you need to do something which would endanger the referential transparency of ByteStrings. Generally you only need to do this when using the FFI to interface with C libraries. Should such a need arise, you can have a look at the internal functions and the unsafe functions. Remember that the last set of functions are called unsafe for a reason - misuse can crash you program!.
1.2 Binary parsing
Once you have your data as a bytestring you'll be wanting to parse something from it. Here you need to install the binary package. Instructions for installing Cabal packages are out of scope for this tutorial, but should be fairly easy to find.
The binary package has three major parts: the
Put monad and a general serialisation for Haskell types. The
latter is like the pickle module that you may know from Python - it
has it's own serialisation format and I won't be covering it any more here.
However, if you just need to persist some Haskell data structures, it might be
exactly what you want: the documentation is
1.2.1 The Get monad
The Get monad is a state monad; it keeps some state and each action updates that state. The state in this case is an offset into the bytestring which is getting parsed. Get parses lazy bytestrings, this is how packages like tar can parse files several gigabytes long in constant memory - they are using a pipeline of lazy bytestrings. However, this also has a downside. When parsing a lazy bytestring a parse failure (such as running off the end of the bytestring) is signified by an exception. Exceptions can only be caught in the IO monad and, because of lazyness, might not be thrown exactly where you expect. If this is a problem you probably want a strict version of Get, which is covered below.
Here's an example of using the Get monad:
import qualified Data.ByteString.Lazy as BL import Data.Binary.Get import Data.Word deserialiseHeader :: Get (Word32, Word32, Word32) deserialiseHeader = do alen <- getWord32be plen <- getWord32be chksum <- getWord32be return (alen, plen, chksum) main :: IO () main = do input <- BL.getContents print $ runGet deserialiseHeader input
This code takes 3, big-endian, 32-bit unsigned numbers from the input string and returns them as a tuple. Let's try running it:
% runhaskell /tmp/example.hs << EOF heredoc> 123412341235 heredoc> EOF (825373492,825373492,825373493)
Makes sense, right? Look what happens if the input is too short:
% runhaskell /tmp/example.hs << EOF tooshort EOF (1953460083,1752134260,example.hs: too few bytes. Failed reading at byte position 12
Here an exception was thrown because we ran out of bytes.
So the Get monad consists of a set of operations like
data. You can see the full list of those functions in the documentation.
Here's another example; decoding an EOF terminated list of numbers list just involves recursion:
listOfWord16 = do empty <- isEmpty if empty then return  else do v <- getWord64be rest <- listOfWord16 return (v : rest)
1.2.2 Strict Get monad
If you're parsing small messages then, firstly your input isn't going to be a lazy bytestring but a string one. That's not reallly a problem because you can easilly convert between them. However, if you want to handle parse failures you either have to write your parser very carefully, or you have to deal with the fact that you can only catch exceptions in the IO monad.
If this is your dilemma, then you need a strict version of the Getmonad. It's almost exactly the same, but a parser of type
string (an error string from the parse) or the result, and the second value is the remaining bytestring when the parser finished.
Let's update the first example with this strict version of Get. You'll have to install the binary-strict package for it to work.
import qualified Data.ByteString as B import Data.Binary.Strict.Get import Data.Word deserialiseHeader :: Get (Word32, Word32, Word32) deserialiseHeader = do alen <- getWord32be plen <- getWord32be chksum <- getWord32be return (alen, plen, chksum) main :: IO () main = do input <- B.getContents print $ runGet deserialiseHeader input
Note that all we're done is change from lazy bytestrings to strict bytestrings and change change to importing Data.Binary.Strict.Get. Now we'll run it again:
% runhaskell /tmp/example.hs << EOF heredoc> 123412341235 heredoc> EOF (Right (825373492,825373492,825373493),"\n")
Now we can see that the parser was successful (we got a Right) and we can see that our shell actually added an extra newline on the input (correctly) and the parser didn't consume that, so it's also returned to us. Now we try it with a truncated input:
% runhaskell /tmp/example.hs << EOF heredoc> tooshort heredoc> EOF (Left "too few bytes","\n")
This time we didn't get an exception, but a Left value, which can be handled in pure code. The remaining bytestring is the same because our truncated input is 9 bytes long, parsing the first two Word32's consumed 8 bytes and parsing the third failed - at which point we had the last byte still in the input.In your parser, you can also call
which will result in a Left value.
1.2.3 Bit twiddling
Even with all this monadic goodness, sometimes you just need to move some bits around. That's perfectly possible in Haskell too. Just import Data.Bits and use the following table.
1.2.4 The BitGet monad
As an alternative to bit twiddling, you can also use the BitGet monad. This is another state-like monad, like Get, but here the state includes the current bit-offest in the input. This means that you can easily pull out unaligned data. Sadly, haddock is current breaking when trying to generate the documentation for BitGet so I'll start with an example.
Here's a description of the header of a DNS packet, direct from RFC 1035:
1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ID | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |QR| Opcode |AA|TC|RD|RA| Z | RCODE | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | QDCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ANCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | NSCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ARCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The actual fields don't matter, but here's a function for parsing it:
parseHeader :: G.Get Header parseHeader = do id <- G.getWord16be flags <- G.getByteString 2 qdcount <- G.getWord16be >>= return . fromIntegral ancount <- G.getWord16be >>= return . fromIntegral nscount <- G.getWord16be >>= return . fromIntegral arcount <- G.getWord16be >>= return . fromIntegral let r = BG.runBitGet flags (do isquery <- BG.getBit opcode <- BG.getAsWord8 4 >>= parseEnum aa <- BG.getBit tc <- BG.getBit rd <- BG.getBit ra <- BG.getBit BG.getAsWord8 3 rcode <- BG.getAsWord8 4 >>= parseEnum return $ Header id isquery opcode aa tc rd ra rcode qdcount ancount nscount arcount) case r of Left error -> fail error Right x -> return x
Here you can see that only the second line (from the ASCII-art diagram) is parsed using BitGet. An outer Get monad is used for everythign else and the bit fields are pulled out with
returns an Either, but it doesn't return the remaining bytestring, just because there's no obvious way to represent a bytestring of a fractional number of bytes.
You can see the list of BitGet functions and their comments in the source code.