Library/AltBinary
From HaskellWiki
BrettGiles (Talk  contribs) m (Categorize) 

(3 intermediate revisions by 2 users not shown) 
Latest revision as of 19:18, 27 December 2006
AltBinary is a part of Streams library, what implements binary I/O and serialization faсilities. Just a list of features implemented in this lib:
 classbased AltBinary interface plus emulation of NewBinary library interface
 compatibility with Hugs and GHC, with GHCspecific speed optimizations
 (de)serialization speed of 2060 mb/sec on 1GHz CPU
 free intermixing of text and binary i/o on the same Stream
 support for bytealigned and bitaligned, lowendian and bigendian serialization using the same interface
 data files written are CPUindependent, f.e. you can serialize data on 32bit lowendian CPU and then read it back on 64bit bigendian one
 classical Binary class with "get" and "put_" functions defines default representation for each type
 get/put_ uses fixedsize encoding for Int8..Word64, but variablelength encoding for Int/Word/Integer (including encoding of array bounds and list lengths)
 any integral value can be saved with explicitly specified size using functions "putBits bh bits value" and "getBits bh bits" (their shortcuts putBit/putWord8...putWord64/getBit/... is also supported for all integral types)
 get/put_ uses UTF8 encoding for strings/chars
 Binary class instances (i.e. get/put_ implementation) for Bounded Enum, Storable, arrays, maps and sets
 lots of alternative representations for Strings, lists and arrays, including ability to use usersupported function to read/write each list/array element, such as "putMArrayWith (putBits 15) h arr". for example, "putString0With putLatin1Char h s" implements ASCIIZ string encoding
 Template Haskell can used to automatically derive new Binary instances
 ability to serialize data to any Stream, including Handle, raw files, memorymapped files, memory and string buffers, pipes, network sockets and so on
 after all, it can work in any monad. the only thing required for it's work is something supporting ByteStream interface, i.e. implementing vGetByte and vPutByte operations in some monad
... and i still don't mentioned some features such as (encode :: a>String) and decode functions. i think that i implemented in this lib everything that anyone (except for Einar ;)) ever imagined :)
I still can't finish documentation for AltBinary, partially because it contains so many features!
Below is results of my 3 attempts to do it. This can't be considered as real docs, it's more like some sketches and ideas about how library can be used. Sorry
[edit] 1 First attempt: ByteStream
ByteStream interface provides primitive byte I/O operations  vGetByte and vPutByte. On top of this, all binary I/O and serialization facilities are built. As the result, you can use these facilities on any ByteStream, ranging from the file to string buffer.
Please note that all binary I/O operations come in pairs with names putXXX and getXXX. I will introduce only "putXXX" operations, you should just know that each operation has its "getXXX" twin.
First, most groundclose layer includes operations putWord16le, putWord32le, putWord64le and putWord16be, putWord32be, putWord64be. Together with the vPutByte operation they allow to write lowendian and bigendian values of any size and even mix lowendian and bigendian representations:
h < openBinaryFD "test" WriteMode vPutByte h (1::Int) putWord16le h (2::Int) putWord32be h (3::Int) putWord64le h (4::Int) vClose h h < openBinaryFD "test" ReadMode a < vGetByte h b < getWord16le h c < getWord32be h d < getWord64le h vClose h print (a::Int, b::Int, c::Int, d::Int)
All these operations work fine not only with Int, but with any integral type (Word, Integer, Int8, Word64 and so on). This allows you to read/write any integral value without explicit type conversion. On the other side, if you write literal constants using these functions, you will need to its type (say, Int) explicitly.
[edit] 2 Second attempt
In AltBinary library there are 4 methods of binary I/O builded on top of each other:
 Byte I/O (vGetByte and vPutByte)
 Integral values I/O (getWordXX and putWordXX)
 Data structures I/O (over 100 operations :) )
 Serialization API (get and put_)
We will study them all sequentially, starting from the lowest level.
[edit] 2.1 Byte I/O
Lowest level, the byte I/O, isn't differ significantly from the Char I/O. All Streams support vGetByte and vPutByte operations, either directly or via buffering transformer. These operations has rather generalized types:
vGetByte :: (Stream m h, Enum a) => h > m a vPutByte :: (Stream m h, Enum a) => h > a > m ()
This allows to read/write any integral and enumeration values without additional type conversions (of course, these values should belong to the 0..255 range)
Together with other Stream operations, such as vIsEOF, vTell/vSeek, vGetBuf/vPutBuf, this allows to write any programs that operate upon binary data. You can freely mix byte and text I/O on one Stream:
main = do vPutByte stdout (1::Int) vPutStrLn stdout "text" vPutBuf stdout buf bufsize
[edit] 2.2 Integral values / bit sequences I/O
The core of this API is two generalized operations:
getBits bits h putBits bits h value
`getBits` reads certain number of bits from given BinaryStream and returns it as value of any integral type (Int, Word8, Integer and so on). `putBits` writes given value as a certain number of bits. The `value`, again, may be of any integral type.
These two operations can be implemented in one of 4 ways, depending on the answers on two questions:  whether integral values written as big or littleendian?  whether values written are bitaligned or bytealigned?
The library allows you to select any answers on these questions. The `h` parameter in this operation represents BinaryStream and there are 4 methods to open BinaryStream on top of plain Stream:
binaryStream < openByteAligned stream  bigendian binaryStream < openByteAlignedLE stream  littleendian binaryStream < openBitAligned stream  bigendian binaryStream < openBitAlignedLE stream  littleendian
Moreover, to simplify your work, Stream by itself can also be used as BinaryStream  in this case bytealigned bigendian representation used. So, you can write, for example:
putBits 16 stdout (0::Int)
or
bh < openByteAlignedLE stdout putBits 16 bh (0::Int)
There is also operation `flushBits h` what aligns BinaryStream on the byte boundary. It fills the rest of pyte with zero bits on output and skip the rest of bits in current bytes on input. Of course, this operation does nothing on bytealigned BinaryStreams.
There are also "shortcut" operations what read/write some number of bits:
getBit h getWord8 h getWord16 h getWord32 h getWord64 h putBit h value putWord8 h value putWord16 h value putWord32 h value putWord64 h value
Although these operations seems like just shortcuts for partial application of getBits/putBits, they are works somewhat faster. In contrast to other binary I/O libraries, each of these operations can accept/return values of any integral type.
You can freely mix text I/O, byte I/O and bits I/O as long as you don't forget to make `flushBits` after bitaligned chunks of I/O:
main = do putWord32 stdout (1::Int)  bytealigned bigendian stdoutLE < openByteAlignedLE stdout putWord32 stdoutLE (1::Int)  bytealigned littleendian putBits 15 stdoutLE (1::Int)  bytealigned littleendian stdoutBitsLE < openBitAlignedLE stdout putBit stdoutBitsLE (1::Int)  bitaligned littleendian putBits 15 stdoutBitsLE (1::Int)  bitaligned littleendian flushBits stdoutBitsLE vPutStrLn stdout "text" stdoutBits < openBitAligned stdout putBit stdoutBits (1::Int)  bitaligned bigendian putBits 15 stdoutBits (1::Int)  bitaligned bigendian flushBits stdoutBit
When you request to write, say, 15 bits to bytealigned BinaryStream, the whole number of bytes are written. In particular, each `putBit` operation on bytealigned BinaryStream writes the whole byte to the stream while the same operation on bitaligned streams fills one bit at a time.
But that is not yet the whole story! There are also operations that allow to intermix littleendian and bigendian I/O:
getWord16le h getWord32le h getWord64le h putWord16le h value putWord32le h value putWord64le h value getWord16be h getWord32be h getWord64be h putWord16be h value putWord32be h value putWord64be h value
For example, you can write:
main = do putWord32le stdout (1::Int)  bytealigned littleendian putWord16be stdout (1::Int)  bytealigned bigendian
Please note that `h` in these operations is a Stream, not BinaryStream. Actually, these operations just perform several fixed vGetByte or vPutByte operations and, strictly speaking, they should be noted in previous section.
There are also combinator versions of `open*` operations, that automatically perform `flushBits` at the finish:
withBitAlignedLE stdout $ \h > do putBit h (1::Int)  bitaligned littleendian putBits 15 h (1::Int)  bitaligned littleendian
I also should say that you can perform all the Stream operations on any BinaryStream, and bitaligned streams will flush themselves before performing any I/O and seeking operations. For example:
h < openBitAligned stdout vPutStr h "text" putBit h (1::Int) vPutByte h (1::Int)  `flushBits` will be automatically  called before this operation putWord16le h (1::Int)  littleendian format will be used here despite  bigendiannes of the BinaryStream itself
[edit] 2.3 Serialization API
This part is a really small! :) There are just two operations:
get h
put_ h a
where `h` is a BinaryStream. These operations read and write binary representation of any value belonging to the class Binary.
[edit] 3 Third attempt
[edit] 3.1 Emulation of Binary interface
This library implements 2 interfaces: Binary and AltBinary. First interface allows to use this library as dropin replacement for the wellknown Binary and NewBinary libs. all you need to do is to replace "import Data.Binary" statement with either
import Data.Binary.ByteAligned
or
import Data.Binary.BitAligned
depending on what type of access you need. in the first case representation of any data value will be written/read as the whole number of bytes, in the second case data values may cross byte boundaries and, for example, Bools will be packed 8 values per byte. please draw attention that despite interface emulation this library and original Binary lib use different representations for most of the data types
[edit] 3.2 AltBinary interface
let s = encode ("11",123::Int,[1..10::Int]) print (decode s::(String,Int,[Int]))
[edit] 3.2.1 Types of binary streams: bit/bytealigned, low/bigendian
AltBinary is the "native" interface of this library to (de)serialize data. It provides the same operations `get` and `put_` to read/write data, but allows to use them directly on Handles and any other streams:
import Data.AltBinary h < openBinaryFile "test" WriteMode put_ h [1..100::Int] hClose h h < openBinaryFile "test" ReadMode x < get h :: IO [Int] print x
if you need bitaligned serialization, use the `openBitAligned` stream transformer:
h < openBinaryFile "test" WriteMode >>= openBitAligned put_ h "string" put_ h True vClose h
of course, to read these data you also need to use `openBitAligned`:
h < openBinaryFile "test" ReadMode >>= openBitAligned x < get h :: IO String y < get h :: IO Bool print (x,y)
The above code writes data in bigendian format, if you need to use lowendian formats, use the following transformers:
h < openBinaryFile "test" WriteMode >>= openByteAlignedLE
and
h < openBinaryFile "test" WriteMode >>= openBitAlignedLE
for the bytealigned and bitaligned access, respectively.
You can also mix the binary and text i/o at the same stream, with only one requirement: use "flushBits h" after you used stream for some bitaligned I/O:
h < openBinaryFile "test" WriteMode >>= openBitAligned put_ h True flushBits h vPutStr h "string" vClose h
it's also possible to use different types of binary atreams on top of one Stream:
h < openBinaryFile "test" WriteMode bh < openBitAligned h put_ bh True flushBits bh bh < openByteAlignedLE h vPutStr bh "string" vClose h
... if you will ever need this :)
[edit] 3.2.2 getBits/putBits; Binary instances for Bool, Maybe, Either
`get` and `put_` operations are just enough if you need only to save some values in Stream and then restore them. but to assemble/parse data in some particular format, you will need some more lowlevel functions, such as `getBits` and `putBits`, which transfers just the specified number of bits:
putBits 32 h (123::Int) x < getBits 32 h :: IO Int
if you call on bytealigned stream putBits with number of bits, what is not divisible by 8, the whole number of bytes are occupied. in particular, putBit on bytealigned streams occupies entire byte
this makes possible to use the same (de)serialization code and in particular the same definitions of Binary instances both for bytealigned and bitaligned streams! for example, the following definition:
instance Binary Bool where put_ h x = putBit h $! (fromEnum x) get h = do x < getBit h; return $! (toEnum x)
allows to encode Bool values with just one bit in bitaligned streams, but uses the whole byte in bytealigned ones. further, serialization code for Maybe types uses Bool values:
instance Binary a => Binary (Maybe a) where put_ bh (Just a) = do put_ bh True; put_ bh a put_ bh Nothing = do put_ bh False get bh = do flag < get bh if flag then do a < get bh; return (Just a) else return Nothing
as a result, representation of `Maybe a` uses just one more bit than representation of type `a` in bitaligned streams, and whole extra byte otherwise. the same story is for Either types
[edit] 3.2.3 getWord8..putWord64; Binary instances for Int8..Word64
most widespread uses of getBits/putBits is for 1/8/16/32/64 bits, and so there are specialized (and sometimes more efficient) versions of these functions, called putBit, putWord8...putWord64 (and of course their get... counterparts). please draw attention that all these functions accept arguments (or return values) of any Integral type (i.e. types what are instances of Integral class  Int, Integer, Word, Int8..Word64), so you don't need to convert types if you want, for example, encode Int as 8bit value:
putWord8 h (length "test")
these fixedbits routines used in definitions of Binary instances for types with fixed sizes  Int8...Word64. types Int, Word and Integer by default uses variablesized representation, which would be described later. if you need to read or write values of these types using fixedsize representation, use appropriate fixedbits procedures instead of get/put_:
putWord16 h (1::Int) putWord32 h (2::Word) putWord64 h (3::Integer)
the same rule applies if you need to write fixedsize value with nondefault number of bits:
putWord8 h (4::Int32)
functions putWord16..putWord64 uses bigendian representation, also known as network byte order  it is the order of bytes, used natively on PowerPC/Sparc processors. in this format, representation of value started fom most significant bytes. if you use bitaligned stream, high bits of each byte are also filled first. if you need littleendian (native for Intel processors) formats, putWord16le..putWord64le is at your service
[edit] 3.2.4 putBounded; Binary instances for Bounded Enum types
next pair of functions uses mininal possible number of bits to encode values in given range [min..max]:
putBounded min max h x x < getBounded min max h
they also support values of any Integral type. These functions are used to provide default Binary instances for all Bounded Enum types (i.e. types which support both Bounded and Enum interfaces). for example, you can declare:
data Color = Red  Green  Blue deriving (Bounded, Enum)
and now you can use get/put_ on Colors; Color values would be encoded using 2 bits in bitaligned streams (of course, whole byte would be used in bytealigned streams)
[edit] 3.2.5 putUnsigned/putInteger/putLength; Binary instances for Int/Integer/Word
putUnsigned provides variablesized encoding, what can be used to represent any nonnegative Integral value using minimal possible number of bytes. it uses 7+1 encoding, i.e. 7 bits in each byte represents bits of actual value, and higher bit used to distinguish last byte in sequence. so, values in range 0..127 would be encoded using one byte, values in range 128..2^141  using two bytes and so on
putInteger is about the same, but allows to encode also negative values, so 64..63 encoded with one byte, 2^13..2^131  with two bytes...
putLength is synonym for putUnsigned, just used to represent lengths of various containers  strings, lists, arrays and so on
put_ uses putInteger to encode Int and Integer, and putUnsigned to encode Word; i don't used fixedsize representation for Int and Word because that will produce data incompatible between 32bit and 64bit platforms. i also don't use internal GHC's representation of Integer to speed up (de)serialization because that will produce data incompatible with other Haskell compilers. but if you need to (de)serialize large number of Integers quickly, you should use putGhcInteger/getGhcInteger procedures, described later. of course, this way your program will become compatible only with the GHC compiler.
[edit] 3.2.6 Binary instances for Char and String (unwritten)
[edit] 3.2.7 Lists support (unwritten)
[edit] 3.2.8 Arrays support
This library supports (de)serialization for all array types, included in standard hierarchical libraries plus PArr arrays, supported only by GHC. Immutable array types can be (de)serialized to any Stream (just like lists); mutable arrays can be (de)serialized only in the corresponding monad (where this array can be read/modified), i.e. IOArray can be get/put only to Stream belonging to IO monad, STArray can be get/put only to Stream belonging to the same state monad. all that is done automatically, just use put_ or get operation on the corresponding array
if you read an array, you may need (or don't need, depending on the surrounding code) to specify its type explicitly, say:
arr < get h :: IO (Array Int Int32)
besides of automatic support for all array types in put_/get
operations, there are also huge number of "lowlevel" array
(de)serialization routines. first, there are routines
putIArray h arr putMArray h arr
what can be used to write to the Stream any array that is instance of IArray or MArray class, correspondingly (the first class contains all immutable arrays: Array, UArray, DiffArray, DiffUArray; the second  all other, mutable arrays  IOArray, IOUArray, STArray, STUArray, StorableArray). corresponding operations to read these arrays require to explicitly pass them bounds of array read:
arr < getIArray h bounds arr < getMArray h bounds
note that these operations are not full analogues of put_/get ones, which writes and reads array bounds automatically. these operations are more lowlevel  they reads/writes only the array elements. also note that, just like the `get` operation, you may need to specify type of the array read:
arr < getIArray h (0,9) :: IO (Array Int Int32)
second, you can read/write array elements with explicitly pointed
(de)serialization procedure for array elements isstead of default
ones, provided by the Binary class. to achive this, add `With` suffix
to routine name and specify procedure to read or write array elements
as the first argument:
putIArrayWith putUnsigned h arr putMArrayWith (putBits 15) h arr arr < getIArrayWith getWord8 h bounds arr < getMArrayWith (getBounded 1 5) h bounds
of course, you can also provide your own read/write procedures, if they have the same types as standard get/put_ functions.
there are also variants of all get operations, which uses `size` parameter
instead of `bounds`, and creates arrays with bounds (0,size1::Int).
they have names with `N` at the end of of procedure name, but before
`With`:
arr < getIArrayN h 10 :: IO (Array Int Int32) arr < getMArrayNWith getWord32 h 10 :: IO (IOArray Int Int)
these operations in some way dubs the similar list procedures
at last, part of the `get` operations have versions, specialized to
specific type constructors. for example, `getMArrayN` have
`getIOArrayN` and `getIOUArrayN` variants which can read only the
IOArray/IOUArray, accordingly. it's just a trick to avoid necessity to
specify array types in `get` operations, say instead of:
arr < getIArrayN h 10 :: IO (Array Int Int32)
one can write
arr < getArrayN h 10
it is nothing more than handy shortcuts. the only exclusion is operations to read `UArray`, what is not specializations of corresponding `IArray` operations, but use some faster algorithm and work only in IO monad. if you need to read `UArray` in any other monad  please use general operations on the `IArray` instead (anyway the compiler will ensure proper use via the typechecking)
so far i don't say anything about specific operations for
(de)serialization of parallel arrays (available only in GHC via
the module GHC.PArr).