Difference between revisions of "HXT"

Revision as of 12:51, 19 September 2006

A gentle introduction to the Haskell XML Toolbox

The Haskell XML Toolbox (HXT) is a collection of tools for processing XML with Haskell. The core component of the Haskell XML Toolbox is a domain specific language, consisting of a set of combinators, for processing XML trees in a simple and elegant way. The combinator library is based on the concept of arrows. The main component is a validating and namespace aware XML-Parser that supports almost fully the XML 1.0 Standard. Extensions are a validator for RelaxNG and an XPath evaluator.

Background

The Haskell XML Toolbox bases on the ideas of HaXml and HXML but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.

The basic concepts

The basic data strutures

Processing of XML is a task of processing tree structures. This is can be done in Haskell in a very elegant way by defining an appropriate tree data type, a Haskell DOM (document object model) structure. The tree structure in HXT is a rose tree with a special XNode data type for storing the XML node information.

The generally useful tree structure (NTree) is separated from the node type (XNode). This allows for reusing the tree structure and the tree traversal and maipulation functions in other applications.

type NTree a  = NTree a [NTree a]     -- rose tree

data XNode    = XText String          -- plain text node
              | ...
              | XTag QName XmlTrees   -- element name and list of attributes
              | XAttr QName           -- attribute name
              | ...

type QName    = ...                   -- qualified name

type XmlTree  = NTree XNode

type XmlTrees = [XmlTree]

The concept of filters

Selecting, transforming and generating trees often requires routines, which compute not only a single result tree, but a (possibly empty) list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions, which take an XML tree as input and compute a list of result trees.

type XmlFilter = XmlTree -> [XmlTree]

More generally we can define a filter as

type Filter a b = a -> [b]

We will do this abstraction later, when introducing arrows. Many of the functions in the following motivating examples can be generalised this way. But for getting the idea, the XmlFilter is sufficient.

The filter functions are used so frequently, that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes neccessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like (.) becomes one of the most frequently used combinators. there are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part.

The first task is to build filters from pure functions, to define a lift operator. Pure functions are liftet to filters in the following way:

Predicates are lifted by mapping False to the empty list and True to the single element list, containing the input tree.

p   :: XmlTree -> Bool		-- pure function
p t =  ...

pf  :: XmlTree -> [XmlTree]	-- or XmlFilter
pf t
  | p t       = [t]
  | otherwise = []

The combinator for this type of lifting is called isA, it works on any type and is defined as

isA  :: (a -> Bool) -> (a -> [a])
isA p x
  | p x       = [x]
  | otherwise = []

A predicate for filtering text nodes looks like this

isText		              :: XmlFilter       -- XmlTree -> [XmlTrees]
isText t@(NTree (XText _) _) =  [t]
isText _                     =  []

Transformers, function that map a tree into another tree, are lifted in a trivial way:

f	:: XmlTree -> XmlTree
f t	=  exp(t)

ff	:: XmlTree -> [XmlTree]
ff t	= [exp(t)]

This basic function is called arr, it comes from the Control.Arrow module of the basic library package of ghc.

Partial functions, functions that can't always compute a result, are usually liftet to totally defined filters:

f	:: XmlTree -> XmlTree
f t
  | p t	      =  expr(t)
  | otherwise = error "f not defined"

ff	:: XmlFilter
ff t
  | p t       = [expr(t)]
  | otherwise = []

This is a rather comfotable situation, with these filters we don't have to deal with illegal argument errors. Illegal arguments are just mapped to the empty list.

When processing trees, there's often the case, that no, exactly one, or more than one result is possible. These functions, returning a set of results are often a bit imprecisely called nondeterministic functions. These functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible.

Working with filters is rather similar to working with binary relations, and working with relations is rather natural and comfortable, database people do know this very well.

Two first examples for working with nondeterministic functions are selecting the children and the grandchildren of an XmlTree which can be implemented by

getChildren	 :: XmlFilter
getChildren (NTree n cs)
  = cs

getGrandChildren :: XmlFilter
getGrandChildren (NTree n cs)
  = concat [ getChildren c | c <- cs ]

Filter combinators

Composition of filters (like function composition) is the most important combinator. We will use the infix operator (>>>) for filter composition and reverse the arguments, so we can read composition sequences from left to right, like with pipes in Unix. Composition is defined as follows:

(>>>)	    :: XmlFilter -> XmlFilter -> XmlFilter

(f >>> g) t =  concat [g t' | t' <- f t]

This definition corresponds 1-1 to the composition of binary relations. With help of the (>>>) operator the definition of getGrandChildren becomes rather simple:

getGrandChildren :: XmlFilter
getGrandChildren = getChildren >>> getChilden

Selecting all text nodes of the children of an element can also be formulated very easily with the help of (>>>)

getTextChildren :: XmlFilter
getTextChildren = getChildren >>> isText

In case of predicate filter the (>>>) serves as a logical and operator, or from the relational view as an intersection operator: isA p1 >>> isA p2 selects all values for which p1 and p2 both hold.

The dual operator to (>>>) is the locical or, (thinking in sets: The union operator). For this we define a sum operator (<+>). The sum of two filters is defined as follows:

(<+>)	    :: XmlFilter -> XmlFilter -> XmlFilter

(f <+> g) t =  f t ++ g t

Example: isA p1 <+> isA p2 is the locical or for filter.

Combining elementary filters with (>>>) and (<+>) leads to more complex functionality. For example, selecting all text nodes within two levels of depth (in left to right order) can be formulated with:

getTextChildren2 :: XmlFilter
getTextChildren2 = getChildren >>> ( isText <+> ( getChildren >>> isText ) )

Exercise: Are these filters equivalent or what's the difference between the two filters?

getChildren >>> ( isText <+> ( getChildren >>> isText ) )

( getChildren >>> isText ) <+> ( getChildren >>> getChildren >>> isText )

Of course we need choice combinators. The first idea is an if-then-else filter, built up from three simpler filters. But often it's easier and more elegant to work with simpler binary combinators for choice. So we will introduce the simpler ones first.

One of these choice combinators is called orElse and is defined as follows:

orElse	:: XmlFilter -> XmlFilter -> XmlFilter
orElse f g t
  | null res1 = g t
  | otherwise = res1
  where
  res1 = f t

The meaning is the following: If f computes a none empty list as result, f succeeds and this list is the result, else g is applied to the input and this yields the result. There are two other simple choice combinators usually written in infix notation, g `guards` f and f `when` g:

guards	:: XmlFilter -> XmlFilter -> XmlFilter
guards g f t
  | null (g t) = []
  | otherwise  = f t

when	:: XmlFilter -> XmlFilter -> XmlFilter
when f g t
  | null (g t) = [t]
  | otherwise  = f t

These choice operators become useful when transforming and manipulation trees.

Tree traversal filter

A very basic operation on tree structures is the traversal of all nodes and the selection and/or transformation of nodes. Theses traversal filters serve as control structures for processing whole trees. They correspond to the map and fold combinators for lists.

The simplest traversal filter does a top down search of all nodes with a special feature. This filter, called deep, is defined as follows:

deep	:: XmlFilter -> XmlFilter -> XmlFilter
deep f  = f `orElse` (getChildren >>> deep f)

When a predicate filter is applied to deep, a top down search is done and all subtrees satisfying the predicate are collected. The descent into the tree stops, when a subtree is found because of the use of orElse.

Example: Selecting all plain text nodes of a document can be formulated with:

deep isText

Example: Selecting all "top level" tables in a HTML documents looks like this:

deep (isElem >>> hasName "table")

A variant of deep, called multi, performs a complete search, where the tree traversal does not stop, when a node is found.

multi	:: XmlFilter -> XmlFilter -> XmlFilter
multi f  = f <+> (getChildren >>> multi f)

Example: Selecting all tables in a HTML document, even nested ones, multi has to be used instead of deep:

multi (isElem >>> hasName "table")

Arrows

We've already seen, that the filters a -> [b] are a very powerful and sometimes a more elgant way to process XML than pure function. This is the good news. The bad news is, that filter are not general enough. Of course we sometimes want to do some I/O and we want to stay in the filter level. So we need something like

type XmlIOFilter = XmlTree -> IO [XmlTree]

for working in the IO monad.

Sometimes it's apropriate to thread some state through the computation like in state monads. This leads to a type like

type XmlStateFilter state = state -> XmlTree -> (state, [XmlTree])

And in real world applications we need both extenstions at the same time. Of course I/O is neccessary but usually there are also some global options and variables for controlling the computations. In HXT, for instance there are variables for controlling trace output, options for setting the default encoding scheme for input data and a base URI for accessing documents, which are addressed in a content or in a DTD part by relative URIs. So we need something like

type XmlIOStateFilter state = state -> XmlTree -> IO (state, [XmlTree])

We want to work with all four filter variants, and in the future perhaps with even more general filters, but of course not with four sets of filter names, e.g. deep, deepST, deepIO, deepIOST.

This is the point where newtypes and classes come in. Classes are needed for overloading names and newtypes are needed to declare instances. Further the restriction of XmlTree as argument and result type is not neccessary and hinders reuse in many cases.

A filter discussed above has all features of an arrow. Arrows are introduced for generalising the concept of functions and function combination to more general kinds of computation than pure functions.

A basic set of combinators for arrows is defined in the classes in the Control.Arrow module, containing the above mentioned (>>>), (<+>), arr.

In HXT the additional classes for filters working with lists as result type are defined in Control.Arrow.ArrowList. The choice operators are in Control.Arrow.ArrowIf, tree filters, like getChildren, deep, multi, ... in Control.Arrow.ArrowTree and the elementary XML specific filters in Text.XML.HXT.XmlArrow.

In HXT there are four types instanciated with these classes for pure list arrows, list arrows with a state, list arrows with IO and list arrows with a state and IO.

newtype LA      a b = LA    { runLA  :: (a -> [b]) }

newtype SLA   s a b = SLA   { runSLA :: (s -> a -> (s, [b])) }

newtype IOLA    a b = IOLA  { runIOLA :: (a -> IO [b]) }

newtype IOSLA s a b = IOSLA { runIOSLA :: (s -> a -> IO (s, [b])) }

The first one and the last one are those used most frequently in the toolbox, and of course there are lifting functions for converting the special arrows into the more general ones.

Don't worry about all these conceptional details. Let's have a look into some Hello world examples.

Getting started: Hello world examples

copyXML

The first complete example is a program for copying an XML document

module Main
where

import Text.XML.HXT.Arrow
import System.Environment

main :: IO ()
main
    = do
      [src, dst] <- getArgs
      runX ( readDocument [(a_validate, v_0)] src
	     >>>
	     writeDocument [] dst
	   )
      return ()

The interesting part of this example is the call of runX. runX executes an arrow. This arrow is one of the more powerful list arrows with IO and a HXT system state.

The arrow itself is a composition of readDocument and writeDocument. readDocument is an arrow for reading, DTD processing and validation of documents. Its behaviour can be controlled by a list of options. Here we turn off the validation step. The src, a file name or an URI is read and parsed and a document tree is build. This tree is piped into the output arrow. This one also is controlled by a set of options. Here all the defaults are used. writeDocument converts the tree into a string and writes it to the dst.

We've omitted here the boring stuff of option parsing and error handling.

Compilation and a test run looks like this:

hobel > ghc -o copyXml -package hxt CopyXML.hs
hobel > cat hello.xml
<hello>world</hello>
hobel > copyXml hello.xml -
<?xml version="1.0" encoding="UTF-8"?>
<hello>world</hello>
hobel >

The mini XML document in file hello.xml is read and a document tree is build. Then this tree is converted into a string and written to standard output (filename: -). It is decorated with an XML declaration containing the version and the output encoding.

For processing HTML documents there is a HTML parser, which tries to parse and interprete rather anything as HTML. The HTML parser can be selected by calling

readDocument [(a_parse_html, v_1), ...]

with the apropriate option.

Pattern for a main program

A more realistic pattern for a simple Unix filter like program has the following structure:

module Main
where

import Text.XML.HXT.Arrow

import System.IO
import System.Environment
import System.Console.GetOpt
import System.Exit

main :: IO ()
main
    = do
      argv <- getArgs
      (al, src, dst) <- cmdlineOpts argv
      [rc]  <- runX (application al src dst)
      if rc >= c_err
	 then exitWith (ExitFailure (0-1))
	 else exitWith ExitSuccess

-- | the dummy for the boring stuff of option evaluation,
-- usually done with 'System.Console.GetOpt'

cmdlineOpts 	:: [String] -> IO (Attributes, String, String)
cmdlineOpts argv
    = return ([(a_validate, v_0)], argv!!0, argv!!1)

-- | the main arrow

application	:: Attributes -> String -> String -> IOSArrow b Int
application al src dst
    = readDocument al src
      >>>
      processChildren (processDocumentRootElement `when` isElem)  -- (1)
      >>>
      writeDocument al dst
      >>>
      getErrStatus


-- | the dummy for the real processing: the identity filter

processDocumentRootElement	:: IOSArrow XmlTree XmlTree
processDocumentRootElement
    = this         -- substitute this by the real application

This program has the same functionality as our first example, but it separates the arrow from the boring option evaluation and return code computation.

The interesing line is (1). readDocument generates a tree structure with a so called extra root node. This root node is a node above the XML document root element. The node above the XML document root element is neccessary because of possible other elements on the same tree level as the XML root, for instance comments, processing instructions or whitespace.

Furthermore the artificial root node serves for storing meta information about the document in the attribute list, like the document name, the encoding scheme, the HTTP transfer headers and other information.

To process the real XML root element, we have to take the children of the root node, select the XML root element and process this, but remain all other children unchanged. This is done with processChildren and the when choice operator. processChildren applies a filter elementwise to all children of a node. All results form processing the list of children from the result node.

The structure of internal document tree can be made visible e.g. by adding the option pair (a_show_tree, v_1) to the writeDocument arrow. This will emit the tree in a readable text representation instead of the real document.

In the next section we will give examples for the processDocumentRootElement arrow.

Selection examples

Selecting text from an HTML document

Selecting all the plain text of an XML/HTML document can be formulated with

selectAllText	:: ArrowXml a => a XmlTree XmlTree
selectAllText
    = deep isText

deep traverses the whole tree, stops the traversal when a node is a text node (isText) and returns all the text nodes. There are two other traversal operators deepest and multi, In this case, where the selected nodes are all leave, these would give the same result.

Selecting text and ALT attribute values

Let's take a bit more complex tast: We want to select all text, but also the values of the alt attributes of image tags.

selectAllTextAndAltValues	:: ArrowXml a => a XmlTree XmlTree
selectAllTextAndAltValues
    = deep
      ( isText                       -- (1)
	<+>
	( isElem >>> hasName "img"   -- (2)
	  >>>
	  getAttrValue "alt"         -- (3)
	  >>>
	  mkText                     -- (4)
	)
      )

The whole tree is searched for text nodes (1) and for image elements (2), from the image elements the alt attribute values are selected as plain text (3), this text is transformed into a text node (4).

Selecting text and ALT attributes values (2)

Let's refine the above filter one step further. The text from the alt attributes shall be marked in the output by surrounding double square brackets. Empty alt values shall be ignored.

selectAllTextAndRealAltValues	:: ArrowXml a => a XmlTree XmlTree
selectAllTextAndRealAltValues
    = deep
      ( isText
	<+>
	( isElem >>> hasName "img"
	  >>>
	  getAttrValue "alt"
	  >>>
	  isA significant            -- (1)
	  >>>
	  arr addBrackets            -- (2)
	  >>>
	  mkText
	)
      )
    where
    significant :: String -> Bool
    significant = not . all (`elem` " \n\r\t")

    addBrackets :: String -> String
    addBrackets s
	=  " [[ " ++ s ++ " ]] "

This example shows two combinators for building arrows from pure functions. The first one isA removes all empty or whitespace values from alt attributes (1), the other arr lifts the editing function to the arrow level (2).

Document construction examples

The Hello World document

The first document, of course, is a Hello World document:

helloWorld	:: ArrowXml a => a XmlTree XmlTree
helloWorld
    = mkelem "html" []              -- (1)
      [ mkelem "head" []
	[ mkelem "title" []
	  [ txt "Hello World" ]     -- (2)
	]
      , mkelem "body"
	[ sattr "class" "haskell" ] -- (3)
	[ mkelem "h1" []
	  [ txt "Hello World" ]     -- (4)
	]
      ]

The main arrows for document construction are mkelem and it's variants (selem, aelem, eelem) for element creation, attr and sattr for attributes and mktext and txt for text nodes.

mkelem takes three arguments, the element name (or tag name), a list of arrows for the construction of attributes, not empty in (3), and a list of arrows for the contents. Text content is generated in (2) and (4).

When writeDocument is called with the indent option ((a_indent, v_1)) the following output is generated (if the indent option is not given, the whole document would appears on a single line):

<?xml version="1.0" encoding="UTF-8"?>
<html>
  <head>
    <title>Hello World</title>
  </head>
  <body class="haskell">
    <h1>Hello World</h1>
  </body>
</html>

The code can be shortened a bit by using some of the convenient functions:

helloWorld2	:: ArrowXml a => a XmlTree XmlTree
helloWorld2
    = selem "html"
      [ selem "head"
	[ selem "title"
	  [ txt "Hello World" ]
	]
      , mkelem "body"
	[ sattr "class" "haskell" ]
	[ selem "h1"
	  [ txt "Hello World" ]
	]
      ]

In the above two examples the arrow input is totally ignored, because of the use of the constant arrow txt "...".

A HTML page with information about all images within a page

A bit more interesting task is the construction of a page containg a table of all images within a page inclusive image URLs, geometry and ALT attributes.

The program for this has a frame similar to the helloWorld program, but the rows of the table must be filled in from the input document. In the first step we will generate a table with a single row containing the URL.

imageTable	:: ArrowXml a => a XmlTree XmlTree
imageTable
    = selem "html"
      [ selem "head"
	[ selem "title"
	  [ txt "Images in Page" ]
	]
      , selem "body"
	[ selem "h1"
	  [ txt "Images in Page" ]
	, selem "table"
	  [ collectImages           -- (1)
	    >>>
	    genTableRows            -- (2)
	  ]
	]
      ]
    where
    collectImages                   -- (1)
	= deep ( isElem
                 >>>
                 hasName "img"
               )
    genTableRows                    -- (2)
	= selem "tr"
	  [ selem "td"
	    [ getAttrValue "src" >>> mkText ]
	  ]

With (1) the image elements are collected, and with (2) the HTML code for an image element is build.

Applied to http://www.haskell.org/ we get the following result (at the time writing this page):

<html>
  <head>
    <title>Images in Page</title>
  </head>
  <body>
    <h1>Images in Page</h1>
    <table>
      <tr>
        <td>/haskellwiki_logo.png</td>
      </tr>
      <tr>
        <td>/sitewiki/images/1/10/Haskelllogo-small.jpg</td>
      </tr>
      <tr>
        <td>/haskellwiki_logo_small.png</td>
      </tr>
    </table>
  </body>
</html>

Transformation examples

Decorating external references of an HTML document

In the following examples, we want to decorate the external references in an HTML page by a small icon, like it's done in many wikis. For this task the document tree has to be traversed, all parts except the intersting A-Elements remain unchanged. At the end of the list of children of an A-Element we add an image element.

Here is the first version:

addRefIcon	:: ArrowXml a => a XmlTree XmlTree
addRefIcon
    = processTopDown                       -- (1)
      ( addImg                             -- (2)
	`when`
	isExternalRef                      -- (3)
      )
    where
    isExternalRef                          -- (4)
	= isElem
	  >>>
	  hasName "a"
          >>>
	  hasAttr "href"
	  >>>
	  getAttrValue "href"
	  >>>
	  isA isExtRef
	where
	isExtRef                           -- (4.1)
	    = isPrefixOf "http:"           -- or something more precise

    addImg
	= replaceChildren                  -- (5)
	  ( getChildren                    -- (6)
	    <+>
	    imgElement                     -- (7)
	  )

    imgElement
	= mkelem "img"                     -- (8)
	  [ sattr "src" "/icons/ref.png"   -- (9)
	  , sattr "alt" "external ref"
	  ] []                             -- (10)

The traversal is done with processTopDown (1). This arrow applies an arrow to all nodes of the whole document tree. The transformation arrow applies the addImg (2) to all A-elements (3),(4). This arrow uses a bit simplified test (4.1) for external URLs. addImg manipulates all children (5) of the A-elements by selecting the current children (6) and adding an image element (7). The image element is constructed with mkelem (8). This takes an element name, a list of arrows for computing the attributes and a list of arrows for computing the contents. The content of the image element is empty (10). The attributes are constructed with sattr (9). sattr ignores the arrow input and builds an attribute form the name value pair of arguments.

More complex examples

to be done