Difference between revisions of "The Monad.Reader/Issue2/Bzlib2Binding"
m |
|||
(6 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
⚫ | |||
− | '''This article needs reformatting! Please help tidy it up.'''--[[User:WouterSwierstra|WouterSwierstra]] 14:14, 9 May 2008 (UTC) |
||
⚫ | |||
⚫ | |||
⚫ | |||
− | [[BR]] |
||
− | ''[[Date]]'' |
||
− | |||
− | '''Abstract.''' |
||
The article presents a Haskell FFI binding for the C library Bzlib2. |
The article presents a Haskell FFI binding for the C library Bzlib2. |
||
This article also exists as a [http://www.student.dtu.dk/~s022018/Bzlib2Hs/tmrartikel.ps PostScript file]. |
This article also exists as a [http://www.student.dtu.dk/~s022018/Bzlib2Hs/tmrartikel.ps PostScript file]. |
||
− | == |
+ | ==Introduction== |
This text is about the Haskell Foreign Function Interface (FFI) which |
This text is about the Haskell Foreign Function Interface (FFI) which |
||
− | is part of the standard library. |
+ | is part of the standard library. It enables Haskell functions to call |
procedures in other languages and other languages to call Haskell |
procedures in other languages and other languages to call Haskell |
||
− | functions. |
+ | functions. This text is for those who like me have found it difficult |
to cross the gap in the FFI documentation between trivial examples and |
to cross the gap in the FFI documentation between trivial examples and |
||
reference texts. |
reference texts. |
||
Line 20: | Line 15: | ||
Having a basic understanding of the FFI is really worthwhile, because it |
Having a basic understanding of the FFI is really worthwhile, because it |
||
adds a new dimension to the language toolbox and gives access to a vast |
adds a new dimension to the language toolbox and gives access to a vast |
||
− | amount of non-Haskell libraries in Haskell programs. |
+ | amount of non-Haskell libraries in Haskell programs. Many libraries |
have already had more or less complete Haskell bindings through the FFI |
have already had more or less complete Haskell bindings through the FFI |
||
− | including [ |
+ | including [[WxHaskell]], |
− | [ |
+ | [[Gtk2Hs]], |
− | [http://www.haskell.org/HOpenGL/ HOpenGL], SDL and PCRE. |
+ | [http://www.haskell.org/HOpenGL/ HOpenGL], [[SDL]] and [[PCRE]]. |
This text is nowhere near a complete introduction for the FFI, in fact |
This text is nowhere near a complete introduction for the FFI, in fact |
||
it is rather limited, but it's intended to answering some of the basic |
it is rather limited, but it's intended to answering some of the basic |
||
question I myself had, when I first tried to understand the |
question I myself had, when I first tried to understand the |
||
− | functioning of the FFI. |
+ | functioning of the FFI. I'll describe a simple binding I made for |
the Bzip2 library for data (de)compression, and I encourage the |
the Bzip2 library for data (de)compression, and I encourage the |
||
− | reader to download the binding and study and play around with it. |
+ | reader to download the binding and study and play around with it. Of |
course patches are welcome. |
course patches are welcome. |
||
We will end up with an interface consisting of the two useful |
We will end up with an interface consisting of the two useful |
||
procedures: |
procedures: |
||
− | + | <haskell> |
|
readBzip2 :: Handle -> IO String |
readBzip2 :: Handle -> IO String |
||
writeBzip2 :: Handle -> String -> IO () |
writeBzip2 :: Handle -> String -> IO () |
||
+ | </haskell> |
||
− | }}} |
||
− | == |
+ | ==Prerequisites== |
You definitely want to know where to find documentation on the FFI, so |
You definitely want to know where to find documentation on the FFI, so |
||
here is a small list of useful references: |
here is a small list of useful references: |
||
⚫ | |||
− | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
− | * FfiCookbook |
||
− | * FfiTutorial |
||
To play around with the library you need to have the binding package, |
To play around with the library you need to have the binding package, |
||
− | libbzip2 (from the Bzip2 package) version 1.0 and GHC 6.4. |
+ | libbzip2 (from the Bzip2 package) version 1.0 and GHC 6.4. I'm sure |
other versions of GHC will work, but I Haven't tested it. |
other versions of GHC will work, but I Haven't tested it. |
||
Line 58: | Line 50: | ||
library. |
library. |
||
− | == |
+ | ==Get on with it== |
− | The binding is in a Darcs repository, which can be fetched and |
+ | The binding is in a [[Darcs]] repository, which can be fetched and |
build like this: |
build like this: |
||
+ | <tt><pre> |
||
− | {{{ |
||
− | + | $ darcs get "http://www.student.dtu.dk/~s022018/Bzlib2Hs" |
|
− | + | $ cd Bzlib2Hs/ |
|
− | + | $ ghc -c Bzip2.lhs |
|
− | + | $ ghc --make Main.hs -lbz2 |
|
− | + | $ ./a.out |
|
+ | </pre></tt> |
||
− | }}} |
||
− | If you want to subsequently try it out interactively in GHCi say: |
+ | If you want to subsequently try it out interactively in GHCi say: |
+ | <tt><pre> |
||
− | {{{ |
||
− | + | $ ghci Bzip2.lhs -lbz2 |
|
+ | </pre></tt> |
||
− | }}} |
||
⚫ | |||
− | |||
⚫ | |||
libbzip2 is the library written by Julian R Seward used in bzip2(1) |
libbzip2 is the library written by Julian R Seward used in bzip2(1) |
||
− | for compressing and decompressing files in the bzip2 data format. |
+ | for compressing and decompressing files in the bzip2 data format. The |
interface file is named bzlib.h and can be found somewhere in your |
interface file is named bzlib.h and can be found somewhere in your |
||
system usually in /usr/include/ or /usr/local/include. On GoboLinux |
system usually in /usr/include/ or /usr/local/include. On GoboLinux |
||
− | it's /System/Links/Headers/bzlib.h. |
+ | it's /System/Links/Headers/bzlib.h. It provides many of procedures of |
− | which we will only need use the high level ones. |
+ | which we will only need use the high level ones. They a still a bit |
complicated, so in the following they are briefly described. |
complicated, so in the following they are briefly described. |
||
⚫ | |||
− | |||
⚫ | |||
The high level procedures is comprised of the following signatures as |
The high level procedures is comprised of the following signatures as |
||
seen in the header file: |
seen in the header file: |
||
− | + | <haskell>#!syntax c |
|
typedef void BZFILE; |
typedef void BZFILE; |
||
− | BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f, |
+ | BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f, |
− | + | int small, int verbosity, |
|
− | + | void *unused, int nUnused ); |
|
int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); |
int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); |
||
Line 99: | Line 89: | ||
void BZ2_bzReadClose ( int *bzerror, BZFILE *b ); |
void BZ2_bzReadClose ( int *bzerror, BZFILE *b ); |
||
− | BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f, |
+ | BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f, |
− | + | int blockSize100k, int verbosity, |
|
− | + | int workFactor ); |
|
void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); |
void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); |
||
void BZ2_bzWriteClose ( int *bzerror, BZFILE* f, |
void BZ2_bzWriteClose ( int *bzerror, BZFILE* f, |
||
− | + | int abandon, |
|
− | + | unsigned int* nbytes_in, |
|
− | + | unsigned int* nbytes_out ); |
|
+ | </haskell> |
||
− | }}} |
||
− | The names of the procedures are pretty descriptive. |
+ | The names of the procedures are pretty descriptive. The open |
procedures opens a stream `BZFILE` of data from the file handle |
procedures opens a stream `BZFILE` of data from the file handle |
||
`f`, which refers to a file already opened for reading or writing |
`f`, which refers to a file already opened for reading or writing |
||
respectively. The arguments `small`, `verbosity`, |
respectively. The arguments `small`, `verbosity`, |
||
`unused` and `nUnused`, |
`unused` and `nUnused`, |
||
− | `blockSize100k`, `verbosity` and `workFactor` |
+ | `blockSize100k`, `verbosity` and `workFactor` |
are for our simple binding not so not so important, and will get |
are for our simple binding not so not so important, and will get |
||
reasonable default values of 0, 0, `NULL`, 0, 9, 0 and 30 |
reasonable default values of 0, 0, `NULL`, 0, 9, 0 and 30 |
||
− | respectively. |
+ | respectively. Refer to the Bzip2 documentation for explanations of the values. |
The read procedure reads at most `len` uncompressed bytes into the |
The read procedure reads at most `len` uncompressed bytes into the |
||
buffer `buf` from the stream `BZFILE` of compressed bytes and |
buffer `buf` from the stream `BZFILE` of compressed bytes and |
||
− | returns the number of bytes read. |
+ | returns the number of bytes read. If the logical end-of-stream marker |
was read, `bzerror` is set to `BZ_STREAM_END`. |
was read, `bzerror` is set to `BZ_STREAM_END`. |
||
Line 135: | Line 125: | ||
stream `f`, and then frees all memory created when opening and |
stream `f`, and then frees all memory created when opening and |
||
writing to |
writing to |
||
− | `f`. |
+ | `f`. It's used to clean up after all error situations when writing. The arguments |
`abandon`, `nbytes_in` and `nbytes_out` are also not relevant to us, and will just get reasonable default values of 0, `NULL` and `NULL`. |
`abandon`, `nbytes_in` and `nbytes_out` are also not relevant to us, and will just get reasonable default values of 0, `NULL` and `NULL`. |
||
Line 141: | Line 131: | ||
accordingly. |
accordingly. |
||
⚫ | |||
− | |||
⚫ | |||
Now where we are a bit familiar with the Bzip2 library procedures, we |
Now where we are a bit familiar with the Bzip2 library procedures, we |
||
− | can introduce some concepts of the FFI. |
+ | can introduce some concepts of the FFI. Making the binding will |
commence in three steps: |
commence in three steps: |
||
− | #Find appropriate Haskell types for the C types. |
+ | # Find appropriate Haskell types for the C types. |
− | #Write an foreign import declaration for each procedure. |
+ | # Write an foreign import declaration for each procedure. |
− | #Write wrapper procedures to get a nice abstraction and a clean interface. |
+ | # Write wrapper procedures to get a nice abstraction and a clean interface. |
These steps are explained in the following. |
These steps are explained in the following. |
||
− | === |
+ | ===Corresponding Haskell and C types=== |
To be able to bridge the gap |
To be able to bridge the gap |
||
between C world and Haskell world, the FFI has defined Haskell types |
between C world and Haskell world, the FFI has defined Haskell types |
||
which corresponds to the normal C data types like integers, reals, |
which corresponds to the normal C data types like integers, reals, |
||
− | arrays and so on. |
+ | arrays and so on. The whole lot is listed in the [http://www.haskell.org/ghc/docs/6.4/html/libraries/base/Foreign.C.html library documentation]. For convenience the types needed for this |
binding is listed in the following table: |
binding is listed in the following table: |
||
− | || C type || Haskell type || |
+ | || C type || Haskell type || |
− | || FILE |
+ | || FILE || CFile || |
− | || int |
+ | || int || CInt || |
− | || unzigned int || CUInt || |
+ | || unzigned int || CUInt || |
− | || char* |
+ | || char* || CString || |
− | || ''a''* |
+ | || ''a''* || Ptr ''a'' || |
− | || void |
+ | || void || () || |
On the right is the Haskell type which corresponds to the C type on |
On the right is the Haskell type which corresponds to the C type on |
||
− | the left. |
+ | the left. The integer types are straight forward. The type |
`CFile` is an opaque data constructor and can thus only be |
`CFile` is an opaque data constructor and can thus only be |
||
− | constructed in a C program an then passed to Haskell. |
+ | constructed in a C program an then passed to Haskell. The |
`CString` is actually just an alias for `Ptr CChar` and the |
`CString` is actually just an alias for `Ptr CChar` and the |
||
− | void type is represented by the Haskell type `()`. |
+ | void type is represented by the Haskell type `()`. As an example |
the C type for a pointer to a value of any type, `void*` can be |
the C type for a pointer to a value of any type, `void*` can be |
||
represented in Haskell as `Ptr ()`. |
represented in Haskell as `Ptr ()`. |
||
⚫ | |||
− | |||
⚫ | |||
As an exercise before proceeding go back and try to write Haskell type |
As an exercise before proceeding go back and try to write Haskell type |
||
signatures corresponding to the C procedure signatures given above. |
signatures corresponding to the C procedure signatures given above. |
||
Generally the foreign import declaration looks like: |
Generally the foreign import declaration looks like: |
||
− | + | <haskell> |
|
− | foreign import ccall "[ |
+ | foreign import ccall "[header name.h] C_name" Haskell_name :: |
− | + | Haskell_names_argument_types -> IO return_type |
|
+ | </haskell> |
||
− | }}} |
||
It consists of the `foreign import` keywords followed by a |
It consists of the `foreign import` keywords followed by a |
||
keyword, `ccall`, telling the compiler to use the C calling |
keyword, `ccall`, telling the compiler to use the C calling |
||
− | convention on the procedure. |
+ | convention on the procedure. Then comes a string with an optional |
name of a header file containing the signature of the procedure and the |
name of a header file containing the signature of the procedure and the |
||
− | procedure name. |
+ | procedure name. Then follows the Haskell function name, we want to |
give the procedure and then it's Haskell type which in our case |
give the procedure and then it's Haskell type which in our case |
||
returns something wrapped in the ''IO'' Monad, since the procedures have |
returns something wrapped in the ''IO'' Monad, since the procedures have |
||
side effects. |
side effects. |
||
− | == |
+ | ==The Haskell binding== |
The road is now paved for understanding the binding which is situated |
The road is now paved for understanding the binding which is situated |
||
− | in `Bzlib2.lhs` in the Darcs repository. |
+ | in `Bzlib2.lhs` in the Darcs repository. Let's go over the main |
− | parts one at a time. |
+ | parts one at a time. Some new things will be introduced along the way |
− | and explained. |
+ | and explained. The first thing in the file is the import |
− | declarations. |
+ | declarations. |
− | + | <haskell> |
|
{-# OPTIONS -fffi #-} |
{-# OPTIONS -fffi #-} |
||
Line 214: | Line 202: | ||
import Foreign.C.String (CString, newCString, peekCStringLen) |
import Foreign.C.String (CString, newCString, peekCStringLen) |
||
import Foreign.Marshal.Alloc (malloc, mallocBytes) |
import Foreign.Marshal.Alloc (malloc, mallocBytes) |
||
+ | </haskell> |
||
− | }}} |
||
Since the library has to be passed a handle to an already open file, |
Since the library has to be passed a handle to an already open file, |
||
the problem quickly arises as who to obtain a `Ptr CFile`, when |
the problem quickly arises as who to obtain a `Ptr CFile`, when |
||
− | we cannot construct a `CFile` from Haskell. |
+ | we cannot construct a `CFile` from Haskell. The solution lies in |
a little trick with which we get a obtain a such through these steps: |
a little trick with which we get a obtain a such through these steps: |
||
− | `Handle` -> `Fd` -> |
+ | `Handle` -> `Fd` -> |
`Ptr CFile`: |
`Ptr CFile`: |
||
− | + | <haskell> |
|
− | -- FILE *fdopen(int fildes, const char *mode); |
+ | -- FILE *fdopen(int fildes, const char *mode); |
foreign import ccall "" fdopen :: Fd -> CString -> IO (Ptr CFile) |
foreign import ccall "" fdopen :: Fd -> CString -> IO (Ptr CFile) |
||
handleToCFile :: Handle -> String -> IO (Ptr CFile) |
handleToCFile :: Handle -> String -> IO (Ptr CFile) |
||
handleToCFile h m = |
handleToCFile h m = |
||
− | + | do iomode <- newCString m |
|
− | + | fd <- handleToFd h |
|
− | + | fdopen fd iomode |
|
+ | </haskell> |
||
− | }}} |
||
Here we wrap the standard C procedure `fdopen` in a function, |
Here we wrap the standard C procedure `fdopen` in a function, |
||
− | `handleToCFile`, which takes a handle to an already open file |
+ | `handleToCFile`, which takes a handle to an already open file |
(for instance obtained with `openFile` from `System.IO`) |
(for instance obtained with `openFile` from `System.IO`) |
||
together with a string which can be ``r'' for read mode or ``w'' for |
together with a string which can be ``r'' for read mode or ``w'' for |
||
− | write mode and is passed to `fdopen`. |
+ | write mode and is passed to `fdopen`. Since the latter takes a |
− | `CString` and not a normal Haskell string we need to convert it |
+ | `CString` and not a normal Haskell string we need to convert it |
with the function `newCString`. |
with the function `newCString`. |
||
⚫ | |||
− | |||
⚫ | |||
Here are the foreign import declarations for all the C procedures |
Here are the foreign import declarations for all the C procedures |
||
− | introduced in the beginning. |
+ | introduced in the beginning. |
− | + | <haskell> |
|
− | bz_OK = 0 |
+ | bz_OK = 0 -- #define BZ_OK 0 |
− | bz_STREAM_END = 4 |
+ | bz_STREAM_END = 4 -- #define BZ_STREAM_END 4 |
− | type BZFile = () |
+ | type BZFile = () -- typedef void BZFILE; |
− | foreign import ccall "bzip.h BZ2_bzReadOpen" bzReadOpen :: |
+ | foreign import ccall "bzip.h BZ2_bzReadOpen" bzReadOpen :: |
− | + | Ptr CInt -> Ptr CFile -> CInt -> CInt -> Ptr () -> CInt -> IO (Ptr BZFile) |
|
− | foreign import ccall "bzip.h BZ2_bzRead" bzRead :: |
+ | foreign import ccall "bzip.h BZ2_bzRead" bzRead :: |
− | + | Ptr CInt -> Ptr BZFile -> Ptr () -> CInt -> IO CInt |
|
− | foreign import ccall "bzip.h BZ2_bzReadClose" bzReadClose :: |
+ | foreign import ccall "bzip.h BZ2_bzReadClose" bzReadClose :: |
− | + | Ptr CInt -> Ptr BZFile -> IO () |
|
+ | </haskell> |
||
− | }}} |
||
− | This should be straight forward by now. |
+ | This should be straight forward by now. The `bzip.h` defines |
some constants, which we will use, so they are declared |
some constants, which we will use, so they are declared |
||
− | correspondingly in Haskell. |
+ | correspondingly in Haskell. Notice that the type alias for |
`BZFile` and the corresponding C definition in the comment. |
`BZFile` and the corresponding C definition in the comment. |
||
⚫ | |||
− | |||
⚫ | |||
Now we can wrap those procedures in a higher abstraction closer to the |
Now we can wrap those procedures in a higher abstraction closer to the |
||
− | way a Haskell programmer would think. |
+ | way a Haskell programmer would think. `readBzip2` takes a handle ''h'' |
− | of a file, which is already opened for reading, and returns its decompressed |
+ | of a file, which is already opened for reading, and returns its decompressed |
content as a string. |
content as a string. |
||
− | + | <haskell> |
|
readBzip2 :: Handle -> IO String |
readBzip2 :: Handle -> IO String |
||
− | readBzip2 h = |
+ | readBzip2 h = |
− | + | let len = 1024 |
|
− | + | verbosity = 0 |
|
− | + | small = 0 |
|
− | + | unused = nullPtr |
|
− | + | nUnused = 0 |
|
+ | in |
||
− | in |
||
− | + | do bzerror <- malloc |
|
− | + | buf <- mallocBytes len -- How big should this be to be efficient? |
|
− | + | cfile <- handleToCFile h "r" |
|
− | + | bzfile <- bzReadOpen bzerror cfile verbosity small unused nUnused |
|
− | + | s <- readIter bzerror bzfile buf len |
|
− | + | bzReadClose bzerror bzfile |
|
− | + | return s |
|
− | |||
readIter :: Ptr CInt -> Ptr BZFile -> Ptr () -> Int -> IO String |
readIter :: Ptr CInt -> Ptr BZFile -> Ptr () -> Int -> IO String |
||
readIter bzerror bzfile buf len = |
readIter bzerror bzfile buf len = |
||
− | + | do nbytes <- bzRead bzerror bzfile buf (fromIntegral len) |
|
− | + | s <- peekCStringLen (castPtr buf, fromIntegral nbytes) |
|
− | + | e <- peek bzerror |
|
− | + | if e==bz_STREAM_END |
|
− | + | then return s |
|
− | + | else do s' <- readIter bzerror bzfile buf len |
|
− | + | return (s++s') |
|
+ | </haskell> |
||
− | }}} |
||
The first procedure defines the default constants, then allocates some |
The first procedure defines the default constants, then allocates some |
||
− | variables and opens the handle for reading compressed content. |
+ | variables and opens the handle for reading compressed content. Then |
it calls the second procedure which loops reading bytes into the |
it calls the second procedure which loops reading bytes into the |
||
− | supplied buffer and returning it in a Haskell string. |
+ | supplied buffer and returning it in a Haskell string. In the end the |
handle is closed and the accumulated string is returned. |
handle is closed and the accumulated string is returned. |
||
− | The value `nullPtr` is the name of `NULL` in C. |
+ | The value `nullPtr` is the name of `NULL` in C. The |
polymorphic function `malloc` returns a pointer to a freshly |
polymorphic function `malloc` returns a pointer to a freshly |
||
allocated value, and since `bzerror` is a `Ptr CInt`, |
allocated value, and since `bzerror` is a `Ptr CInt`, |
||
− | `malloc` will allocate an integer. |
+ | `malloc` will allocate an integer. `mallocBytes` allocates |
− | an array of bytes and returns a pointer to the first byte. |
+ | an array of bytes and returns a pointer to the first byte. We can |
convert between most integral Haskell and C types using |
convert between most integral Haskell and C types using |
||
− | `fromIntegral` as seen in `readIter`. |
+ | `fromIntegral` as seen in `readIter`. |
− | Two new procedures are `peekCStringLen` and `peek`. |
+ | Two new procedures are `peekCStringLen` and `peek`. The |
latter returns the value pointed to by it's pointer argument; it's |
latter returns the value pointed to by it's pointer argument; it's |
||
− | dereferencing the pointer. |
+ | dereferencing the pointer. Like `malloc` it's polymorphic a thus |
− | derives it's correct type and behavior from the context. |
+ | derives it's correct type and behavior from the context. The former |
− | is more complicated. |
+ | is more complicated. Here's it's type: |
− | + | <haskell> |
|
peekCStringLen :: (Ptr CChar, Int) -> IO String |
peekCStringLen :: (Ptr CChar, Int) -> IO String |
||
+ | </haskell> |
||
− | }}} |
||
It takes a pair of a character array and a length ''n'' and delivers the |
It takes a pair of a character array and a length ''n'' and delivers the |
||
− | first ''n'' bytes from the array as a Haskell string. |
+ | first ''n'' bytes from the array as a Haskell string. Because of the |
type conflict between `buf`, which has type `Ptr ()` and the |
type conflict between `buf`, which has type `Ptr ()` and the |
||
`Ptr CChar` in `peekCStringLen`, we need to explicitly cast |
`Ptr CChar` in `peekCStringLen`, we need to explicitly cast |
||
Line 327: | Line 312: | ||
I rather sure that the entire file will be read in before the string |
I rather sure that the entire file will be read in before the string |
||
− | is returned, which obviously is a place for improvement. |
+ | is returned, which obviously is a place for improvement. Can you see |
how that can be done? |
how that can be done? |
||
⚫ | |||
− | |||
⚫ | |||
− | |||
⚫ | |||
⚫ | |||
should not cause surprises any more: |
should not cause surprises any more: |
||
− | + | <haskell> |
|
− | foreign import ccall "bzip.h BZ2_bzWriteOpen" bzWriteOpen :: |
+ | foreign import ccall "bzip.h BZ2_bzWriteOpen" bzWriteOpen :: |
− | + | Ptr CInt -> Ptr CFile -> CInt -> CInt -> CInt -> IO (Ptr BZFile) |
|
− | foreign import ccall "bzip.h BZ2_bzWrite" bzWrite :: |
+ | foreign import ccall "bzip.h BZ2_bzWrite" bzWrite :: |
− | + | Ptr CInt -> Ptr BZFile -> Ptr () -> CInt -> IO () |
|
− | foreign import ccall "bzip.h BZ2_bzWriteClose" bzWriteClose :: |
+ | foreign import ccall "bzip.h BZ2_bzWriteClose" bzWriteClose :: |
− | + | Ptr CInt -> Ptr BZFile -> CInt -> Ptr CUInt -> Ptr CUInt -> IO () |
|
+ | </haskell> |
||
− | }}} |
||
The write procedure takes a handle to a file already opened for |
The write procedure takes a handle to a file already opened for |
||
writing and a string and compresses the string into the file. |
writing and a string and compresses the string into the file. |
||
− | + | <haskell> |
|
writeBzip2 :: Handle -> String -> IO () |
writeBzip2 :: Handle -> String -> IO () |
||
writeBzip2 h s = |
writeBzip2 h s = |
||
− | + | let len = length s |
|
− | + | blockSize100k = 9 |
|
− | + | verbosity = 0 |
|
− | + | workFactor = 30 |
|
− | + | abandon = 0 |
|
− | + | nbytes_in = nullPtr |
|
− | + | nbytes_out = nullPtr |
|
− | + | in |
|
− | + | do bzerror <- malloc |
|
− | + | cs <- newCString s |
|
− | + | cfile <- handleToCFile h "w" |
|
− | + | bzfile <- bzWriteOpen bzerror cfile blockSize100k verbosity workFactor |
|
− | + | bzWrite bzerror bzfile (castPtr cs) (fromIntegral len) |
|
− | + | bzWriteClose bzerror bzfile abandon nbytes_in nbytes_out |
|
+ | </haskell> |
||
− | }}} |
||
− | Most of the things in here were explained in the read section. |
+ | Most of the things in here were explained in the read section. The |
procedure `newCString s` just takes a Haskell string and returns |
procedure `newCString s` just takes a Haskell string and returns |
||
a pointer to a freshly allocated character array initialised with the |
a pointer to a freshly allocated character array initialised with the |
||
− | string. |
+ | string. |
The reflecting reader might already have realized that the input |
The reflecting reader might already have realized that the input |
||
string is allocated in a buffer in one big chunk, which could be very |
string is allocated in a buffer in one big chunk, which could be very |
||
− | big and in fact is as big as the file itself. |
+ | big and in fact is as big as the file itself. Also neither of the |
procedures handles error conditions reported through the variable |
procedures handles error conditions reported through the variable |
||
`bzerror`. |
`bzerror`. |
||
⚫ | |||
− | |||
− | |||
⚫ | |||
We've written a wrapper for the C library Bzlib2 with a very simple |
We've written a wrapper for the C library Bzlib2 with a very simple |
||
− | and useful interface consisting of the two procedures |
+ | and useful interface consisting of the two procedures |
− | + | <haskell> |
|
− | readBzip2 :: Handle -> IO String |
+ | readBzip2 :: Handle -> IO String |
writeBzip2 :: Handle -> String -> IO () |
writeBzip2 :: Handle -> String -> IO () |
||
+ | </haskell> |
||
− | }}} |
||
Although simple the binding has some severe shortcomings, |
Although simple the binding has some severe shortcomings, |
||
namely lack of error checking and lack of laziness in the strings |
namely lack of error checking and lack of laziness in the strings |
||
− | passed to and from the procedures. |
+ | passed to and from the procedures. Error checking was avoided to keep |
the implementation uncluttered and could be implemented by after each |
the implementation uncluttered and could be implemented by after each |
||
call which can raise an error to observe the value of `bzerror` |
call which can raise an error to observe the value of `bzerror` |
||
− | with `e <- peek bzerror` and acting appropriately. |
+ | with `e <- peek bzerror` and acting appropriately. The laziness |
− | problem is harder; when should we call |
+ | problem is harder; when should we call |
− | `bzReadClose bzerror bzfile` if we only read in half of the file |
+ | `bzReadClose bzerror bzfile` if we only read in half of the file |
− | and moving on? |
+ | and moving on? In the case of writing it's a tad easier. Write a |
loop which lazily consumes the input string and writes a small chunk |
loop which lazily consumes the input string and writes a small chunk |
||
to the files at the time. |
to the files at the time. |
||
Patches and comments are most welcome. |
Patches and comments are most welcome. |
||
+ | |||
− | ---- |
||
+ | [[Category: Article]] |
||
− | CategoryArticle |
Latest revision as of 06:03, 1 June 2022
A binding for Bzlib2
- by Peter Eriksen for The Monad.Reader, Issue Two
The article presents a Haskell FFI binding for the C library Bzlib2. This article also exists as a PostScript file.
Introduction
This text is about the Haskell Foreign Function Interface (FFI) which is part of the standard library. It enables Haskell functions to call procedures in other languages and other languages to call Haskell functions. This text is for those who like me have found it difficult to cross the gap in the FFI documentation between trivial examples and reference texts.
Having a basic understanding of the FFI is really worthwhile, because it adds a new dimension to the language toolbox and gives access to a vast amount of non-Haskell libraries in Haskell programs. Many libraries have already had more or less complete Haskell bindings through the FFI including WxHaskell, Gtk2Hs, HOpenGL, SDL and PCRE.
This text is nowhere near a complete introduction for the FFI, in fact it is rather limited, but it's intended to answering some of the basic question I myself had, when I first tried to understand the functioning of the FFI. I'll describe a simple binding I made for the Bzip2 library for data (de)compression, and I encourage the reader to download the binding and study and play around with it. Of course patches are welcome.
We will end up with an interface consisting of the two useful procedures:
readBzip2 :: Handle -> IO String
writeBzip2 :: Handle -> String -> IO ()
Prerequisites
You definitely want to know where to find documentation on the FFI, so here is a small list of useful references:
- The Haskell 98 Foreign Function Interface 1.0: An Addendum to the Haskell 98 Report
- Haskell Hierarchical Libraries Foreign package
- Tackling the awkward squad: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell, p. 46
To play around with the library you need to have the binding package, libbzip2 (from the Bzip2 package) version 1.0 and GHC 6.4. I'm sure other versions of GHC will work, but I Haven't tested it.
Some knowledge of C is required since we'll make a binding for a C library.
Get on with it
The binding is in a Darcs repository, which can be fetched and build like this:
$ darcs get "http://www.student.dtu.dk/~s022018/Bzlib2Hs" $ cd Bzlib2Hs/ $ ghc -c Bzip2.lhs $ ghc --make Main.hs -lbz2 $ ./a.out
If you want to subsequently try it out interactively in GHCi say:
$ ghci Bzip2.lhs -lbz2
The library
libbzip2 is the library written by Julian R Seward used in bzip2(1) for compressing and decompressing files in the bzip2 data format. The interface file is named bzlib.h and can be found somewhere in your system usually in /usr/include/ or /usr/local/include. On GoboLinux it's /System/Links/Headers/bzlib.h. It provides many of procedures of which we will only need use the high level ones. They a still a bit complicated, so in the following they are briefly described.
Signatures
The high level procedures is comprised of the following signatures as seen in the header file:
#!syntax c
typedef void BZFILE;
BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f,
int small, int verbosity,
void *unused, int nUnused );
int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f,
int blockSize100k, int verbosity,
int workFactor );
void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
void BZ2_bzWriteClose ( int *bzerror, BZFILE* f,
int abandon,
unsigned int* nbytes_in,
unsigned int* nbytes_out );
The names of the procedures are pretty descriptive. The open procedures opens a stream `BZFILE` of data from the file handle `f`, which refers to a file already opened for reading or writing respectively. The arguments `small`, `verbosity`, `unused` and `nUnused`, `blockSize100k`, `verbosity` and `workFactor` are for our simple binding not so not so important, and will get reasonable default values of 0, 0, `NULL`, 0, 9, 0 and 30 respectively. Refer to the Bzip2 documentation for explanations of the values.
The read procedure reads at most `len` uncompressed bytes into the buffer `buf` from the stream `BZFILE` of compressed bytes and returns the number of bytes read. If the logical end-of-stream marker was read, `bzerror` is set to `BZ_STREAM_END`.
The write procedure compresses and writes `len` bytes from the buffer `buf` to the stream `b`.
The read close procedure frees all memory created when opening and reading from `b` and is also used to clean up after all error situations when reading.
The write close procedure flushes all data still hanging around the stream `f`, and then frees all memory created when opening and writing to `f`. It's used to clean up after all error situations when writing. The arguments `abandon`, `nbytes_in` and `nbytes_out` are also not relevant to us, and will just get reasonable default values of 0, `NULL` and `NULL`.
In all procedures if an error occurs `bzerror` is set accordingly.
The Foreign Function Interface
Now where we are a bit familiar with the Bzip2 library procedures, we can introduce some concepts of the FFI. Making the binding will commence in three steps:
- Find appropriate Haskell types for the C types.
- Write an foreign import declaration for each procedure.
- Write wrapper procedures to get a nice abstraction and a clean interface.
These steps are explained in the following.
Corresponding Haskell and C types
To be able to bridge the gap between C world and Haskell world, the FFI has defined Haskell types which corresponds to the normal C data types like integers, reals, arrays and so on. The whole lot is listed in the library documentation. For convenience the types needed for this binding is listed in the following table:
|| C type || Haskell type || || FILE || CFile || || int || CInt || || unzigned int || CUInt || || char* || CString || || a* || Ptr a || || void || () ||
On the right is the Haskell type which corresponds to the C type on the left. The integer types are straight forward. The type `CFile` is an opaque data constructor and can thus only be constructed in a C program an then passed to Haskell. The `CString` is actually just an alias for `Ptr CChar` and the void type is represented by the Haskell type `()`. As an example the C type for a pointer to a value of any type, `void*` can be represented in Haskell as `Ptr ()`.
The foreign import declaration
As an exercise before proceeding go back and try to write Haskell type signatures corresponding to the C procedure signatures given above. Generally the foreign import declaration looks like:
foreign import ccall "[header name.h] C_name" Haskell_name ::
Haskell_names_argument_types -> IO return_type
It consists of the `foreign import` keywords followed by a keyword, `ccall`, telling the compiler to use the C calling convention on the procedure. Then comes a string with an optional name of a header file containing the signature of the procedure and the procedure name. Then follows the Haskell function name, we want to give the procedure and then it's Haskell type which in our case returns something wrapped in the IO Monad, since the procedures have side effects.
The Haskell binding
The road is now paved for understanding the binding which is situated in `Bzlib2.lhs` in the Darcs repository. Let's go over the main parts one at a time. Some new things will be introduced along the way and explained. The first thing in the file is the import declarations.
{-# OPTIONS -fffi #-}
module Bzip2 where
import Control.Monad
import System.IO
import System.Posix.IO
import System.Posix.Types
import Foreign.Ptr
import Foreign.Storable
import Foreign.C.Types (CFile, CInt, CUInt)
import Foreign.C.String (CString, newCString, peekCStringLen)
import Foreign.Marshal.Alloc (malloc, mallocBytes)
Since the library has to be passed a handle to an already open file, the problem quickly arises as who to obtain a `Ptr CFile`, when we cannot construct a `CFile` from Haskell. The solution lies in a little trick with which we get a obtain a such through these steps: `Handle` -> `Fd` -> `Ptr CFile`:
-- FILE *fdopen(int fildes, const char *mode);
foreign import ccall "" fdopen :: Fd -> CString -> IO (Ptr CFile)
handleToCFile :: Handle -> String -> IO (Ptr CFile)
handleToCFile h m =
do iomode <- newCString m
fd <- handleToFd h
fdopen fd iomode
Here we wrap the standard C procedure `fdopen` in a function, `handleToCFile`, which takes a handle to an already open file (for instance obtained with `openFile` from `System.IO`) together with a string which can be ``r for read mode or ``w for write mode and is passed to `fdopen`. Since the latter takes a `CString` and not a normal Haskell string we need to convert it with the function `newCString`.
Wrapping the read procedures
Here are the foreign import declarations for all the C procedures introduced in the beginning.
bz_OK = 0 -- #define BZ_OK 0
bz_STREAM_END = 4 -- #define BZ_STREAM_END 4
type BZFile = () -- typedef void BZFILE;
foreign import ccall "bzip.h BZ2_bzReadOpen" bzReadOpen ::
Ptr CInt -> Ptr CFile -> CInt -> CInt -> Ptr () -> CInt -> IO (Ptr BZFile)
foreign import ccall "bzip.h BZ2_bzRead" bzRead ::
Ptr CInt -> Ptr BZFile -> Ptr () -> CInt -> IO CInt
foreign import ccall "bzip.h BZ2_bzReadClose" bzReadClose ::
Ptr CInt -> Ptr BZFile -> IO ()
This should be straight forward by now. The `bzip.h` defines some constants, which we will use, so they are declared correspondingly in Haskell. Notice that the type alias for `BZFile` and the corresponding C definition in the comment.
The read wrapper procedures
Now we can wrap those procedures in a higher abstraction closer to the way a Haskell programmer would think. `readBzip2` takes a handle h of a file, which is already opened for reading, and returns its decompressed content as a string.
readBzip2 :: Handle -> IO String
readBzip2 h =
let len = 1024
verbosity = 0
small = 0
unused = nullPtr
nUnused = 0
in
do bzerror <- malloc
buf <- mallocBytes len -- How big should this be to be efficient?
cfile <- handleToCFile h "r"
bzfile <- bzReadOpen bzerror cfile verbosity small unused nUnused
s <- readIter bzerror bzfile buf len
bzReadClose bzerror bzfile
return s
readIter :: Ptr CInt -> Ptr BZFile -> Ptr () -> Int -> IO String
readIter bzerror bzfile buf len =
do nbytes <- bzRead bzerror bzfile buf (fromIntegral len)
s <- peekCStringLen (castPtr buf, fromIntegral nbytes)
e <- peek bzerror
if e==bz_STREAM_END
then return s
else do s' <- readIter bzerror bzfile buf len
return (s++s')
The first procedure defines the default constants, then allocates some variables and opens the handle for reading compressed content. Then it calls the second procedure which loops reading bytes into the supplied buffer and returning it in a Haskell string. In the end the handle is closed and the accumulated string is returned.
The value `nullPtr` is the name of `NULL` in C. The polymorphic function `malloc` returns a pointer to a freshly allocated value, and since `bzerror` is a `Ptr CInt`, `malloc` will allocate an integer. `mallocBytes` allocates an array of bytes and returns a pointer to the first byte. We can convert between most integral Haskell and C types using `fromIntegral` as seen in `readIter`.
Two new procedures are `peekCStringLen` and `peek`. The latter returns the value pointed to by it's pointer argument; it's dereferencing the pointer. Like `malloc` it's polymorphic a thus derives it's correct type and behavior from the context. The former is more complicated. Here's it's type:
peekCStringLen :: (Ptr CChar, Int) -> IO String
It takes a pair of a character array and a length n and delivers the first n bytes from the array as a Haskell string. Because of the type conflict between `buf`, which has type `Ptr ()` and the `Ptr CChar` in `peekCStringLen`, we need to explicitly cast the pointer type using the polymorphic `castPtr`.
I rather sure that the entire file will be read in before the string is returned, which obviously is a place for improvement. Can you see how that can be done?
Wrapping the write procedures
Now to the corresponding write procedures. The foreign import declarations should not cause surprises any more:
foreign import ccall "bzip.h BZ2_bzWriteOpen" bzWriteOpen ::
Ptr CInt -> Ptr CFile -> CInt -> CInt -> CInt -> IO (Ptr BZFile)
foreign import ccall "bzip.h BZ2_bzWrite" bzWrite ::
Ptr CInt -> Ptr BZFile -> Ptr () -> CInt -> IO ()
foreign import ccall "bzip.h BZ2_bzWriteClose" bzWriteClose ::
Ptr CInt -> Ptr BZFile -> CInt -> Ptr CUInt -> Ptr CUInt -> IO ()
The write procedure takes a handle to a file already opened for writing and a string and compresses the string into the file.
writeBzip2 :: Handle -> String -> IO ()
writeBzip2 h s =
let len = length s
blockSize100k = 9
verbosity = 0
workFactor = 30
abandon = 0
nbytes_in = nullPtr
nbytes_out = nullPtr
in
do bzerror <- malloc
cs <- newCString s
cfile <- handleToCFile h "w"
bzfile <- bzWriteOpen bzerror cfile blockSize100k verbosity workFactor
bzWrite bzerror bzfile (castPtr cs) (fromIntegral len)
bzWriteClose bzerror bzfile abandon nbytes_in nbytes_out
Most of the things in here were explained in the read section. The procedure `newCString s` just takes a Haskell string and returns a pointer to a freshly allocated character array initialised with the string.
The reflecting reader might already have realized that the input string is allocated in a buffer in one big chunk, which could be very big and in fact is as big as the file itself. Also neither of the procedures handles error conditions reported through the variable `bzerror`.
Conclusion
We've written a wrapper for the C library Bzlib2 with a very simple and useful interface consisting of the two procedures
readBzip2 :: Handle -> IO String
writeBzip2 :: Handle -> String -> IO ()
Although simple the binding has some severe shortcomings, namely lack of error checking and lack of laziness in the strings passed to and from the procedures. Error checking was avoided to keep the implementation uncluttered and could be implemented by after each call which can raise an error to observe the value of `bzerror` with `e <- peek bzerror` and acting appropriately. The laziness problem is harder; when should we call `bzReadClose bzerror bzfile` if we only read in half of the file and moving on? In the case of writing it's a tad easier. Write a loop which lazily consumes the input string and writes a small chunk to the files at the time.
Patches and comments are most welcome.