Hexpat/

From HaskellWiki
Jump to navigation Jump to search

hexpat XML parser library

hexpat is an XML parser library available on hackage Here is an example to get you started:

-- | A "hello world" example of hexpat that lazily parses a document, printing
-- it to standard out.

import Text.XML.Expat.Tree
import Text.XML.Expat.Format
import System.Environment
import System.Exit
import System.IO
import qualified Data.ByteString.Lazy as L

main = do
    args <- getArgs
    case args of
        [filename] -> process filename
        otherwise  -> do
            hPutStrLn stderr "Usage: helloworld <file.xml>"
            exitWith $ ExitFailure 1

process :: String -> IO ()
process filename = do
    inputText <- L.readFile filename
    -- Note: Because we're not using the tree, Haskell can't infer the type of
    -- strings we're using so we need to tell it explicitly with a type signature.
    let (xml, mErr) = parse defaultParserOptions inputText :: (UNode String, Maybe XMLParseError)
    -- Process document before handling error, so we get lazy processing.
    L.hPutStr stdout $ format xml
    putStrLn ""
    case mErr of
        Nothing -> return ()
        Just err -> do
            hPutStrLn stderr $ "XML parse failed: "++show err
            exitWith $ ExitFailure 2

Speed of hexpat

The differentiating feature from other XML libraries is speed. So, here is a graph showing some benchmarks against two other XML libraries.

--Blackh 23:34, 26 March 2009 (UTC)

Graphs

Hexpat-benchmark.non-threaded.png

Hexpat-benchmark.threaded.png

Notes

hexpat pays a penalty on the threaded runtime that other libraries don't, and this shrinks its advantage slightly.

The benchmarks are calculated in the following way:

  • The graph shows results for 67 XML files ranging in size from 1 to 100 k bytes. These files were chosen at random with an even spread of sizes off my Ubuntu system.
  • For each parser, we parse the XML to a tree structure, then use 'rnf' from Control.Parallel.Strategies to force evaluation of the tree.
  • For each parser/file, we parse the same file repeatedly for 200 ms, then divide total CPU time by the number of iterations. This gives mean CPU time per iteration.
  • The entire test suite is done 5 times, and the median of each result is taken.

There are three reasons why hexpat is faster than the other libraries:

  • hexpat uses expat to do the parsing, which is a very fast parser written in C
  • hexpat's tree structure contains less information than the other libraries, so hexpat is doing less work
  • hexpat can optionally use the Data.Text data type, which is faster than the standard Haskell String. (Both data types are shown on the graph for hexpat.)

Versions/hardware used:

  • Linux amentet 2.6.27-7-generic #1 SMP Tue Nov 4 19:33:06 UTC 2008 x86_64 GNU/Linux
  • Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz
  • The Glorious Glasgow Haskell Compilation System, version 6.10.1.20090314
  • xml-1.3.4
  • HaXml-1.13.3
  • hexpat-0.5

Note: The HXT library does not work on the version of GHC I am using. The error is

benchmark: error: a C finalizer called back into Haskell.
   use Foreign.Concurrent.newForeignPtr for Haskell finalizers.

I will re-run the benchmark when this is fixed.

I am one of the authors of hexpat, so this makes me biased: I do not know any switches to make the other libraries work at their best speed. So that you can check yourself, here is the software that does the benchmarks, including raw results for the graphs above:

File:Hexpat-benchmark.tar.bz2