HXT/Conversion of Haskell data from/to XML

From HaskellWiki
Revision as of 14:34, 19 April 2008 by UweSchmidt (talk | contribs)

Jump to: navigation, search

Serializing and deserializing Haskell data to/from XML

With so called pickler functions and arrows, it becomes rather easy and straight forward to convert native Haskell values to XML and vice versa. The module Text.XML.HXT.Arrow.Pickle and submodules contain a set of picklers (conversion functions) for simple data types and pickler combinators for complex types.

The idea: XML pickler

For conversion of native Haskell data from and to external representations, there are two functions necessary, one for generating the external representation and one for reading/parsing the representation. The read/show pair often form such a pair of functions.

A so called pickler is a data value with two such conversion functions, but because it's necessary to apply a whole sequence of conversion functions, there is a state that has to be updated during encoding and decoding the external representation. So the simplest form of a pickler converting between a type a and a sequence of Chars looks like this.

type St    = [Char]

data PU a  = PU { appPickle   :: (a, St) -> St
		, appUnPickle :: St -> (a, St)

Andrew Kennedy has described in a programming pearl paper [1], how to define primitive picklers and a set of pickler combinators to de-/serialize from/to (Byte-)Strings.

The HXT picklers are an adaptation of these pickler combinators. The difference to Andrew Kennedys approach is, that the target is not a list of Chars but a list of XmlTrees. The basic picklers will convert data into XML text nodes. New are the picklers for creating elements and attributes.

The HXT pickler type is defined as follows

data St		= St { attributes :: [XmlTree]
		     , contents   :: [XmlTree]

data PU a	= PU { appPickle   :: (a, St) -> St
		     , appUnPickle :: St -> (Maybe a, St)
		     , theSchema   :: Schema

In XML there are two places for storing informations, the attributes and the contents. Furthermore the pickler contains a third component for type information. This enables the derivation of a DTD from a set of picklers.

But we will see, that with the predefined picklers and the combinators we don't have to look very much into these internals. Let's start with an example.

Example: Processing football league data

The XML data structure

From the set of HXT/Practical example we'll take the data structure from HXT/Practical/Simple2 dealing with football league data. First let's have an idea about the structure of the XML data. Here is a part of the example XML data

<SEASON YEAR="1998">
  <LEAGUE NAME="National League">
    <DIVISION NAME="East">
      <TEAM CITY="Atlanta" NAME="Braves">
        <PLAYER GIVEN_NAME="Marty" SURNAME="Malloy"
            POSITION="Second Base" GAMES="11"
            GAMES_STARTED="8" AT_BATS="28" RUNS="3"
            HITS="5" DOUBLES="1" TRIPLES="0"
            HOME_RUNS="1" RBI="1" STEALS="0"
        <PLAYER GIVEN_NAME="Ozzie" SURNAME="Guillen"
            POSITION="Shortstop" GAMES="83"
            GAMES_STARTED="59" AT_BATS="264" RUNS="35"
            HITS="73" DOUBLES="15" TRIPLES="1"
            HOME_RUNS="1" RBI="22" STEALS="1"
        <PLAYER GIVEN_NAME="Danny" ... HIT_BY_PITCH="0">
        <PLAYER GIVEN_NAME="Gerald" ...>
      <TEAM CITY="Florida" NAME="Marlins">
      <TEAM CITY="Montreal" NAME="Expos">
      <TEAM CITY="New York" NAME="Mets">
      <TEAM CITY="Philadelphia" NAME="Phillies">
  <LEAGUE NAME="American League">
    <DIVISION NAME="East">
    <DIVISION NAME="Central">

The Haskell data model

Let's first analyze the underlying data model and then define an appropriate set of Haskell data type for internal representation.

  • The root type is a Season, consisting of a year an a set of Leagues
  • The Leagues are all identified by a String and consist of a set of Divisions, so it's a Map.
  • The Divisions are also ideitfied by a String and consist of a list of Teams, so it's also a Map
  • A Team has three components, a teamName, a city, and a list of Players
  • A Player has a lot of attributes, we will simplify the internal model a bit, we will just include six fields, the firstName, the lastName, the position, atBats, hits and era. All others will be ignored.

So the Haskell data model looks like this

import Data.Map (Map, fromList, toList)

data Season = Season
    { sYear    :: Int
    , sLeagues :: Leagues
	      deriving (Show, Eq)

type Leagues   = Map String Divisions

type Divisions = Map String [Team]

data Team = Team
    { teamName :: String
    , city     :: String
    , players  :: [Player]
	    deriving (Show, Eq)
data Player = Player
    { firstName :: String
    , lastName  :: String
    , position  :: String
    , atBats    :: Maybe Int
    , hits      :: Maybe Int
    , era       :: Maybe Float
	      deriving (Show, Eq)

The predefined picklers

In HXT here is a class XmlPickler defining a single function xpickle for overloading the xpickle name.

class XmlPickler a where
    xpickle :: PU a

For the simple data types there is an instance for XmlPickler, which uses the primitive pickler xpPrim for conversion from and to XML text nodes. This primitive pickler is available for all types supporting read and show.

instance XmlPickler Int where
    xpickle = xpPrim

instance XmlPickler Integer where
    xpickle = xpPrim


For composite data there are predefined pickler combinators for tuples, lists and Maybe types.

instance (XmlPickler a, XmlPickler b) => XmlPickler (a,b) where
    xpickle = xpPair xpickle xpickle

instance XmlPickler a => XmlPickler [a] where
    xpickle = xpList xpickle

instance XmlPickler a => XmlPickler (Maybe a) where
    xpickle = xpOption xpickle
  • xpPair take two picklers and builds up a pickler for a tuple type. There are also pickler combinators for triples, 4- and 5- tuples.
  • xpList takes a pickler for an element type and gives a list pickler
  • xpOption takes a pickler and returns a pickler for optional values.

Furthermore we need pickler for generating/reading element and attribute nodes

  • xpElem generates/parses an XML element node
  • xpAttr generates/parses an attribute node

Most of the other structured data is pickled/unpickled by converting the data to/from tuples, lists and options. This is done by a wrapper pickler xpWrap.

Constructing the example picklers

For every Haskell type we will define a pickler.

For the own data types we will declare instances of XmlPickler

instance XmlPickler Season where
    xpickle = xpSeason

instance XmlPickler Team where
    xpickle = xpTeam

instance XmlPickler Player where
    xpickle = xpPlayer

Then the picklers are developed top down starting with xpSeason.

xpSeason	:: PU Season
    = xpElem "SEASON" $
      xpWrap ( uncurry Season
	     , \ s -> (sYear s, sLeagues s)) $
      xpPair (xpAttr "YEAR" xpickle) xpLeagues

A Season value is mapped onto an element SEASON with xpElem. This constructs/reads the XML SEASON element. The two components of Season are wrapped into a pair with xpWrap. xpWrap needs a pair of functions for a 1-1 mapping between Season and (Int, Leagues). The first component of the pair, the year is mapped onto an attribute YEAR, the attribute value is handled with the predefined pickler for Int. The second one, the Leagues are handled by xpLeagues.

xpLeagues	:: PU Leagues
    = xpWrap ( fromList
	     , toList ) $
      xpList $
      xpElem "LEAGUE" $
      xpPair (xpAttr "NAME" xpText) xpDivisions

xpLeagues has to deal with a Map value. This can't done directly, but the Map value is converted to/from a list of pairs with xpWrap and (fromList, toList). Then the xpList is applied for the list of pairs. Each pair will be represented by an LEAGUE element, the name is mapped to an attribute NAME, the divisions are handled by xpDivisions.

xpDivisions	:: PU Divisions
    = xpWrap ( fromList
	     , toList
	     ) $
      xpList $
      xpElem "DIVISION" $
      xpPair (xpAttr "NAME" xpText) xpickle

The divisions are pickled by the same pattern as the leagues.

xpTeam	:: PU Team
    = xpElem "TEAM" $
      xpWrap ( uncurry3 Team
	     , \ t -> (teamName t, city t, players t)) $
      xpTriple (xpAttr "NAME" xpText) (xpAttr "CITY" xpText) (xpList xpickle)

With the teams we have to wrap the three components into a 3-tuple with xpWrap and then pickle a triple of two attributes and a list of players.

xpPlayer	:: PU Player
    = xpElem "PLAYER" $
      xpWrap ( \ ((f,l,p),(a,h,e)) -> Player f l p a h e
	     , \ t -> ((firstName t, lastName t, position t),(atBats t, hits t, era t))) $
      xpPair (xpTriple (xpAttr "GIVEN_NAME" xpText)
	               (xpAttr "SURNAME"    xpText)
	               (xpAttr "POSITION"   xpText))
             (xpTriple (xpOption (xpAttr "AT_BATS" xpickle))
	               (xpOption (xpAttr "HITS"    xpickle))
	               (xpOption (xpAttr "ERA"     xpPrim )))

The Player pickler looks a bit clumsy. A Player is mapped to an element PLAYER. But because of the many components, six in this case, we wrap a Player value in a pair of triples to use the predefined picklers xpPair and xpTriple. When needing picklers for more than five components, it's straight forward to derive e.g. an 'xp10Tuple`` from the sources of xpTriple and others.

New in this case is the use of xpOption for mapping Maybe values onto optional attributes.

The other attributes used in the input, are ignored during unpickling the XML, but this is the only place where the pickler is tolerant with wrong XML.

A simple application

import Text.XML.HXT.Arrow

-- ...

main	:: IO ()
    = do
      runX ( xunpickleDocument xpSeason [ (a_validate,v_0)
					, (a_trace, v_1)
					, (a_remove_whitespace,v_1)
					, (a_preserve_comment, v_0)
					] "simple2.xml"
	     xpickleDocument xpSeason [ (a_indent, v_1)
				      ] "new-simple2.xml"
      return ()

-- the dummy for processing the unpickled data

processSeason	:: IOSArrow Season Season
    = arrIO ( \ x -> do {print x ; return x})

This application reads in the complete data used in HXT/Practical/Simple2 from file simple2.xml and unpickles it into a Season value. This value is processed (dummy: print out) by processSeason and pickled again into new-simple2.xml

The unpickled value, when formated a bit, looks like this

      { sYear = 1998
      , sLeagues = fromList
	[ ( "American League"
	  , fromList
	    [ ( "Central"
	      , [ Team { teamName = "White Sox"
		       , city = "Chicago"
		       , players = []}
		, ...
	    , ( "East"
	      , [ Team { teamName = "Orioles"
		       , city = "Baltimore"
		       , players = []}
		, ...
	    , ( "West"
	      , [ Team { teamName = "Angels"
		       , city = "Anaheim"
		       , players = []}
		, ...
	, ( "National League"
	  , fromList
	    [ ( "Central"
	      , [ Team { teamName = "Cubs"
		       , city = "Chicago"
		       , players = []}
		, ...
	    , ( "East"
	      , [ Team { teamName = "Braves"
		       , city = "Atlanta"
		       , players =
			 [ Player { firstName = "Marty"
				  , lastName = "Malloy"
				  , position = "Second Base"
				  , atBats = Just 28
				  , hits = Just 5
				  , era = Nothing}
			 , Player { firstName = "Ozzie"
				  , lastName = "Guillen"
				  , position = "Shortstop"
				  , atBats = Just 264
				  , hits = Just 73
				  , era = Nothing}
			 , ...
		, ...
	    , ( "West"
	      , [ Team { teamName = "Diamondbacks"
		       , city = "Arizona"
		       , players = []}
		, ...

A few words of advice

These picklers are a powerful tool for de-/serializing from/to XML. Only a few lines of code are needed for serializing as well as for deserializing. But they are absolutely intolerant when dealing with none valid XML. They are intended to read machine generated XML, ideally generated by the same pickler. When unpickling hand written or by foreign tools generated XML, please validate the XML before reading, preferably with RelaxNG or XML Schema, because of the more powerful type system than those with DTDs.

When designing picklers, one must be careful to put enough markup into the XML structure, to read the XML back without the need for a lookahead and without any ambiguities. The simplest case of a not working pickler is a pair of primitve picklers e.g. for some text. In this case the text is written out and concatenated into a single string, when parsing the XML, there will only be a single text and the pickler will fail because of a missing value for the second component. So at least every primitive pickler must be combined with an xpElem or xpAttr.

Please do not try to convert a whole large database into a single XML file with this approach. This will run into memory problems when reading the data, because of the DOM approach used in HXT. In the HXT distribution, there is a test case in the examples dir performance, where the pickling and unpickling is done with XML documents containing 2 million elements. This is the limit for a 1G Intel box (tested with ghc 6.8).

There are two strategies to overcome these limitations. The first is a SAX like approach, reading in simple tags and text elements and not building a tree structure, but writing the data instantly into a database. For this approach the Tagsoup package can be useful. The disadvantage is the programming effort for collecting and converting the data.

The second and recommended way is, to split the whole bunch of data into smaller pieces, unpickle these and link the resulting documents together by the use of 'hrefs.