Difference between revisions of "Dealing with binary data"
AdamLangley (talk | contribs) |
AdamLangley (talk | contribs) m |
||
Line 75: | Line 75: | ||
found. |
found. |
||
− | You should review the |
+ | You should review the [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html documentation] |
which lists all the functions which operate on ByteStrings. The documentation |
which lists all the functions which operate on ByteStrings. The documentation |
||
for the various types (lazy Word8, strict Char8, ...) are all very similar. You |
for the various types (lazy Word8, strict Char8, ...) are all very similar. You |
||
Line 87: | Line 87: | ||
to do this when using the FFI to interface with C libraries. Should such a need |
to do this when using the FFI to interface with C libraries. Should such a need |
||
arise, you have have a look at the |
arise, you have have a look at the |
||
− | + | [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Internal.html internal functions] and the |
|
− | + | [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Unsafe.html unsafe functions]. |
|
Remember that the last set of functions are called unsafe for a reason - misuse |
Remember that the last set of functions are called unsafe for a reason - misuse |
||
can crash you program!. |
can crash you program!. |
||
Line 96: | Line 96: | ||
Once you have your data as a bytestring you'll be wanting to parse something |
Once you have your data as a bytestring you'll be wanting to parse something |
||
from it. Here you need to install the |
from it. Here you need to install the |
||
− | <tt> |
+ | <tt>[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1 binary]</tt> package. |
Instructions for installing Cabal packages are out of scope for this tutorial. |
Instructions for installing Cabal packages are out of scope for this tutorial. |
||
Line 105: | Line 105: | ||
However, if you just need to persist some Haskell data structures, it might be |
However, if you just need to persist some Haskell data structures, it might be |
||
exactly what you want: the documentation is |
exactly what you want: the documentation is |
||
− | + | [http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html here] |
|
==== The <tt>Get</tt> monad ==== |
==== The <tt>Get</tt> monad ==== |
Revision as of 03:15, 29 January 2008
Handling Binary Data with Haskell
Many programming problems call for the use of binary formats for compactness, ease-of-use, compatibility or speed. This page quickly covers some common libraries for handling binary data in Haskell.
ByteStrings
Everything else in this tutorial will be based on bytestrings. Normal Haskell
String
types are linked lists of 32-bit charactors. This has a
number of useful properties like coverage of the Unicode space and lazyness,
however when it comes to dealing with byte-wise data the String
involves a space-inflation of about 24x and a large reduction in speed.
Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a uint8_t[]
- although bytestrings know their length and don't allow overflows etc.
Their are two major flavours of bytestrings, strict and lazy. Strict bytestrings are exactly what you would expect - a linear array of bytes in memory. Lazy bytestrings are a list of strict bytestrings, often this is called a cord in other languages. When reading a lazy bytestring from a file, the data will be read chunk by chunk and the file can be larger than the size of memory. The default chunk size is currently 32K.
Within each flavour of bytestring comes the Word8 and Char8 versions. These are
mostly an aid to the type system since they are fundamentally the same size of
element. The Word8 unpacks as a list of Word8
elements (bytes),
the Char8 unpacks as a list of Char
, which may be useful if you
want to convert them to Strings
You might want to open the documentation for strict bytestrings and lazy bytestrings in another tab so that you can follow along.
Simple file IO
Here's a very simple program which copies a file from standard input to standard output
module Main where
import qualified Data.ByteString as B
main :: IO ()
main = do
contents <- B.getContents
B.putStr contents
Note that we are using strict bytestrings here. (It's quite common to import the
ByteString
module under the names B
or BS
.)
Since the bytestrings are strict the code will read the whole of stdin into
memory and then write it out. If the input was too large this would overflow
the availble memory and fail.
Let's see the same program using lazy bytestrings. We are just changing the imported ByteString module to be the lazy one and calling the exact same functions from the new module:
module Main where
import qualified Data.ByteString.Lazy as BL
main :: IO ()
main = do
contents <- BL.getContents
BL.putStr contents
This code, because of the lazy bytestrings, will cope with any sized input and
will start producing output before all the input has been read. You can think
of the code as setting up a pipeline, rather than executing in-order, as you
might expect. As putStr
needs more data, it will cause the lazy
bytestring contents
to read more until the end of the input is
found.
You should review the documentation
which lists all the functions which operate on ByteStrings. The documentation
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
generally find the same functions in each, with the same names. Remember to
import the modules as qualified
and give them different names.
The Guts of ByteStrings
I'll just mention in passing that somes you need to do something which would endanger the referential transparency of ByteStrings. Generally you only need to do this when using the FFI to interface with C libraries. Should such a need arise, you have have a look at the internal functions and the unsafe functions. Remember that the last set of functions are called unsafe for a reason - misuse can crash you program!.
Binary parsing
Once you have your data as a bytestring you'll be wanting to parse something from it. Here you need to install the binary package. Instructions for installing Cabal packages are out of scope for this tutorial.
The binary package has three major parts: the Get
monad,
the Put
monad and a general serialisation for Haskell types. The
latter is like the pickle module that you may know from Python - it
has it's own serialisation format and I won't be covering it any more here.
However, if you just need to persist some Haskell data structures, it might be
exactly what you want: the documentation is
here