BerkeleyDBXML

Introduction

If you are using Berkeley DB, and not Berkeley DB XML, then please skip to the Berkeley DB section.

Berkeley DB XML is a powerful, fully transactional, XML-based database that uses XQuery (a W3C standard) as its query language. (Berkeley DB XML does NOT use SQL.)

This page is an introduction/tutorial on writing a multi-threaded Berkeley DB XML application in Haskell. It is intended for Haskell programmers who are new to Berkeley DB XML.

I hope you will consider the advantages of using an XML database over the traditional SQL database in your application. However, note that Berkeley DB and DB XML are non-free for commercial use.

Obtaining and building the packages

Downloads

Berkeley DB XML is easy to build. On Unix, the ./buildall.sh script will build everything for you, including Berkeley DB, and put the resulting image into the 'install' directory. You can then copy this directory's contents to an install location of your choice.

On a GNU/Linux system, you may want to add the 'lib' directory of this install location under /etc/ld.so.conf.d/ then run "ldconfig". This will allow the system to find the Berkeley DB XML libraries. If you don't do this, you will have to set the environment variable LD_LIBRARY_PATH.

If you are using a Unix system, your system may already have a sufficiently recent version of Berkeley DB. In this case, it is better to use this and build Berkeley DB XML only. The commands for this are as follows:

./buildall.sh --build-one=xerces
./buildall.sh --build-one=xqilla
./buildall.sh --build-one=dbxml --with-berkeleydb-prefix=/usr

To test your installation, see if you can run the 'dbxml' command from the install image's bin directory. This is an interactive utility that allows you to run database queries and view the results.

The Berkeley DB XML binding for Haskell is a standard Cabal package. Its README file gives installation instructions.

The binding

This tutorial uses a Haskell binding for DB XML that sticks closely to Berkeley DB XML's C++ interface, so we are programming at a fairly low level.

DB XML would lend itself to the development of higher-level wrappers. For example, someone could write a drop-in replacement for STM (Software Transactional Memory) that uses DBXML to give persistent storage.

Adventure game example

In the Berkeley DB XML binding distribution, you will find a tiny adventure game under examples/adventure/. This tutorial will refer to the code in this example.

Note that this game is multi-user and the game world, including player locations, is stored persistently, so it survives a re-start of the adventure server.

Here is an example session:

blackh@amentet:~/temp/BerkeleyDBXML-0.5/examples/adventure$ ./adventure
Adventure server - please telnet into port 1888
tidying up 0 cadavers
Creating the game world...

blackh@amentet:~$ telnet localhost 1888
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Welcome to 'DB/XML Haskell binding' adventure by Stephen Blackheath
Please enter your name.
> Stephen
Welcome for the first time, Stephen.

For help, please type "help".

You are on a wide, white sandy beach. A bright blue ocean stretches to the
horizon. Along the beach to the north you can see some large rocks. There is
thick jungle to the west.
You can see
  a starfish
> get starfish
You pick up a starfish.
> west
You are in a dense jungle.
You can see
  a tall, twisty tree
> drop starfish
You drop a starfish.
> look
You are in a dense jungle.
You can see
  a starfish
  a tall, twisty tree
>

Berkeley DB

DB XML is built on top of Berkeley DB. This section describes the concepts specific to DB.

DB Environment

Berkeley DB is not client-server like most SQL databases. It accesses its database files in a local directory.

An "environment" is a directory with various odd-looking files such as __db.001 and log.0000000001, as well as various database files that your application has created. An application would normally have only one environment, where it would store all its databases. Database transactions operate within an environment, so a single transaction can span multiple database files in the same environment. This also means that you can update both Berkeley DB files and Berkeley DB XML files in a single transaction.

When DB_THREAD is enabled, the environment will work safely with multi-threaded applications, and also multiple processes accessing the databases at the same time. This means you can run the 'dbxml' command-line utility while your application is running.

When the DB_INIT_LOG flag is enabled, the environment contains a transaction log that forms part of the databases. Do not delete these files, or you will corrupt your databases. Also when DB_INIT_LOG is enabled, you cannot move your database files from one environment to another. The recommended way to do this is to use Berkeley's dbxml_dump/dbxml_load for DB XML files and db_dump/db_load for DB files.

It is safe, however, to delete databases you have created without deleting the environment. The environment will detect this and adjust accordingly. You can, of course, start with a clean slate by deleting all the environment files and databases.

A production application must periodically call dbEnv_txn_checkpoint to clear old data from the log.* files. (The adventure game example does not do this.)

Here is some example code which will open an existing environment, or create it if it doesn't exist. These are the flags to use for a transactional, multi-threaded application:

    dbenv <- dbEnv_create []

    -- Enable automatic deadlock detection.
    dbEnv_set_lk_detect dbenv DB_LOCK_DEFAULT

    dbEnv_open dbenv "." [DB_CREATE,DB_INIT_LOCK,DB_INIT_LOG,DB_INIT_MPOOL,
        DB_INIT_TXN,DB_THREAD,DB_RECOVER] 0

Deadlock detection

Berkeley DB will automatically detect deadlocks for you, allowing you to re-start the deadlocked transaction. Because of the way Berkeley DB has been engineered, deadlock detection is not optional in multi-threaded applications. It is absolutely impossible to avoid deadlocks by the traditional method of carefully controlling the order of locking, because Berkeley DB will lock whole pages, which means it will unpredictably lock more than you told it to.

Your application needs one and only one lock detector thread or process running per environment. dbEnv_set_lk_detect is an easy way to spawn one such thread. See the Berkeley DB documentation for other ways.

If your application has more than one process, you can't do it the way this example does it. You would need to manage things so only one lock detector was running.

Because of deadlock detection, your code must detect deadlocks and re-start the transaction if they are found. Here is some code to do this:

-- Execute the specified code within a database transaction, automatically
-- re-trying if a deadlock is detected.
inTransaction :: XmlManager -> (XmlTransaction -> IO a) -> IO a
inTransaction mgr code = inTransaction_ mgr code 0
    where
        inTransaction_ mgr code retryCount = do
            trans <- xmlManager_createTransaction mgr []
            catch
                (do
                        result <- code trans
                        xmlTransaction_commit trans
                        return result
                    )
                (\err -> do
                        xmlTransaction_abort trans
                        let dbErr = getDbError err
                        if (dbErr == Just DB_LOCK_DEADLOCK) && (retryCount < 20)
                            then do
                                 hPutStrLn stderr "<<retry deadlocked thread>>"
                                 inTransaction_ mgr code (retryCount+1)
                            else ioError err
                    )

This shows the use of a function getDbError which is specific to this Haskell binding. When Berkeley DB returns an error code, the binding will throw a Haskell ioError. The getDbError function extracts the Berkeley DB error code from the ioError. A similar function exists for DB XML-level errors.

Remember that the code above pre-supposes that you have started a deadlock detector. If this hasn't happened, the application will stall and never throw DB_LOCK_DEADLOCK.

Because your transaction can be re-started, you should not do any normal I/O inside your transaction. It would be even better if (like in Software Transactional Memory) the transactional code runs in a monad of its own that prevents normal access to the IO monad.

Environment recovery

Before you start your application, you must run a database recovery to return the database to a consistent state, in case of a dirty shutdown. This can either be done with the db_recover command line utility, or by specifying the DB_RECOVER flag to dbEnv_open.

An environment recovery must run without any other processes accessing the database environment. Therefore it must be performed before you start your application.

Because we are using the DB_RECOVER flag to do our recovery, we could not run multiple processes of 'adventure' at the same time unmodified. If we wanted this application to work with multiple processes, both the DB_RECOVER flag and the dbEnv_set_lk_detect call would need to be removed and run separately before the application was started.

Berkeley DB XML

All the important topics are covered in "Getting Started with Berkeley DB XML" guide that comes with the Berkeley DB XML distribution, so I will only cover more Haskell-specific things here.

Berkeley DBXML returns its document contents as a strict ByteString containing XML text. You need to use an XML library of some kind to handle these. The Haskell binding leaves you free to choose your own XML library. (I am also the developer of the hexpat and hexpat-pickle packages, so you might consider using them since I designed them to fit nicely with BerkeleyDBXML.)

Please take a look at the source code of the adventure example included with the DB XML binding distribution. Here are some examples from it:

Example 1: Querying documents

Here is an example that covers a lot of ground: The "query" function from the adventure game:

collectM :: Monad m => m (Maybe a) -> m [a]
collectM valueM = do
    value <- valueM
    case value of
        Just item -> do
            rest <- collectM valueM
            return (item:rest)
        Nothing -> do
            return []

query_ :: (XmlManager, XmlContainer, XmlTransaction) -> PU [UNode String] p
      -> String -> [(String, XmlValue)] -> [DbXmlFlag] -> IO [(XmlDocument, p)]
query_ (mgr, cont, trans) pickler queryText params flags = do
    qctx <- xmlManager_createQueryContext mgr LiveValues Eager
    let collection = xmlContainer_getName cont
    xmlQueryContext_setDefaultCollection qctx collection
    forM params $ \(name, value) -> do
        xmlQueryContext_setVariableValue qctx name value
    res <- xmlManager_query mgr (Just trans) queryText qctx flags
    docs <- collectM (xmlResults_next res)
    records <- forM docs $ \doc -> do
                text <- xmlDocument_getContent doc
                value <- case unpickleXML' defaultParserOptions (xpRoot pickler) text of
                    Left err -> fail $ "unpickle failed: "++err
                    Right value -> return value
                return (doc, value)
    return records

query :: XmlPickler [UNode String] p => (XmlManager, XmlContainer, XmlTransaction) -> PU [UNode String] p
      -> String -> [(String, XmlValue)] -> IO [p]
query ctx pickler queryText params = liftM (map snd) $ query_ ctx pickler queryText params []

The 'query' function is a helper that calls 'query_' and returns the results as Haskell data structures only, discarding the XmlDocument objects. (XmlDocuments are useful as a reference to a document for updating or deleting.)

Now look at 'query_'. First, we create a query context. This holds the variable assignments used in the XQuery. For example, if we call 'query' like this...

    items <- query db xpItem "collection()/item[@location=$loc]"
        [("loc", xmlString loc)]

...then we push "loc" and its value into the query context, so the XQuery parser can resolve the variable $loc. This query says "give me all documents with a top-level tag of <item> containing a 'location' attribute matching $loc".

xmlQueryContext_setDefaultCollection allows the XQuery to refer our document container as just "collection()" rather than having to name it explicitly in the XQuery string.

Then we run the query, and use a helper called 'collectM' to extract the results from the XmlResults object and return them as a list of XmlDocument objects.

The last step is iterate over the returned documents, using hexpat-pickle's unpickle functionality to translate the XML document into Haskell data structures.

Example 2: Updating a document

-- | Query with write lock. Returned document allows the document to be updated
-- without having to specify its document name.
queryUpdate :: XmlPickler [UNode String] p => (XmlManager, XmlContainer, XmlTransaction) -> PU [UNode String] p
      -> String -> [(String, XmlValue)] -> IO [(XmlDocument, p)]
queryUpdate ctx pickler queryText params = query_ ctx pickler queryText params [DB_FLAG DB_RMW]

update :: forall p . XmlPickler [UNode String] p =>
          (XmlManager, XmlContainer, XmlTransaction)
       -> XmlDocument
       -> p
       -> IO ()
update (mgr, cont, trans) doc p = do
    xmlDocument_setContent doc (pickleXML' (xpRoot xpickle :: PU (UNode String) p) p)
    uctx <- xmlManager_createUpdateContext mgr
    xmlContainer_updateDocument cont (Just trans) doc uctx

'queryUpdate' works like 'query' in the previous example, except that it sets the DB_RMW flag, which means you get a write (exclusive) lock instead of the default read (non-exclusive) lock. Using a write lock makes no difference to the semantics of the transaction: It is just as atomic if you use a read lock. But, write locks can reduce the probability of transaction re-starts due to deadlocks, and so they improve efficiency when updating.

A caller would pass the XmlDocument returned by queryUpdate to 'update', along with a modified version of the Haskell data structure p.

'update' then pickles p, and proceeds to stuff the resulting XML string into the XmlDocument that queryUpdate gave us, and then issues the update.

Unicode

UTF-8 encoding is used throughout to encode Unicode text. All String arguments and return values in the binding are in Unicode, except for XML text, which is returned as 8-bit characters in a String data type. Your XML library will convert this to Unicode for you.

The function xmlValue_asString is a case where the caller has to make the right choice. xmlValue_asString converts an XmlValue to a Unicode Haskell String. This is appropriate if you are fetching the text contents of an XML tag, for instance.

However, if you are fetching XML text, you will want to call xmlValue_asString8Bit. This leaves out the conversion from UTF-8 to Unicode, so you can let your XML library convert this to Unicode.

Conclusion

I hope this gets you started writing DB XML applications. If you have any questions (so I can improve this page), or wish to report bugs in the Haskell binding, please contact me at Stephen Blackheath's anti-spam page.

--Blackh 10:28, 1 October 2008 (UTC)