Difference between revisions of "HXT"

From HaskellWiki
Jump to: navigation, search
 
Line 6: Line 6:
   
 
The Haskell XML Toolbox bases on the ideas of [http://www.cs.york.ac.uk/fp/HaXml/ HaXml] and [http://www.flightlab.com/~joe/hxml/ HXML] but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.
 
The Haskell XML Toolbox bases on the ideas of [http://www.cs.york.ac.uk/fp/HaXml/ HaXml] and [http://www.flightlab.com/~joe/hxml/ HXML] but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.
  +
  +
== The basic concepts ==
  +
  +
=== The basic data strutures ===
  +
  +
Processing of XML is a task of processing tree structures. This is can be done in Haskell in a very elegant way by defining an appropriate tree data type, a Haskell DOM (document object model) structure. The tree structure in HXT is a rose tree with a special XNode data type for storing the XML node information.
  +
  +
The generally useful tree structure (NTree) is separated from the node type (XNode). This allows for reusing the tree structure and the tree traversal and maipulation functions in other applications.
  +
  +
<haskell>
  +
type NTree a = NTree a [NTree a] -- rose tree
  +
  +
data XNode = XText String -- plain text node
  +
| ...
  +
| XTag QName XmlTrees -- element name and list of attributes
  +
| XAttr QName -- attribute name
  +
| ...
  +
  +
type QName = ... -- qualified name
  +
  +
type XmlTree = NTree XNode
  +
  +
type XmlTrees = [XmlTree]
  +
</haskell>
  +
  +
=== The concept of filters ===
  +
  +
Selecting, transforming and generating trees often requires routines, which compute not only a single result tree, but a (possibly empty) list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions, which take an XML tree as input and compute a list of result trees.
  +
  +
<haskell>
  +
type XmlFilter = XmlTree -> [XmlTree]
  +
</haskell>
  +
  +
More generally we can define a filter as
  +
  +
<haskell>
  +
type Filter a b = a -> [b]
  +
</haskell>
  +
  +
We will do this abstraction later, when introducing arrows. Many of the functions in the following motivating examples can be generalised this way. But for getting the idea, the <hask>XmlFilter</hask> is sufficient.
  +
  +
The filter functions are used so frequently, that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes neccessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like (.) becomes one of the most frequently used combinators. there are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part.
  +
  +
The first task is to build filters from pure functions, to define a lift operator. Pure functions are liftet to filters in the following way:
  +
  +
Predicates are lifted by mapping False to the empty list and True to the single element list, containing the input tree.
  +
  +
<haskell>
  +
p :: XmlTree -> Bool -- pure function
  +
p t = ...
  +
  +
pf :: XmlTree -> [XmlTree] -- or XmlFilter
  +
pf t
  +
| p t = [t]
  +
| otherwise = []
  +
</haskell>
  +
  +
The combinator for this type of lifting is called <hask>isA</hask>, it works on any type and is defined as
  +
  +
<haskell>
  +
isA :: (a -> Bool) -> (a -> [a])
  +
isA p x
  +
| p x = [x]
  +
| otherwise = []
  +
</haskell>
  +
  +
A predicate for filtering text nodes looks like this
  +
  +
<haskell>
  +
isText :: XmlFilter -- XmlTree -> [XmlTrees]
  +
isText t@(NTree (XText _) _) = [t]
  +
isText _ = []
  +
</haskell>
  +
  +
Transformers, function that map a tree into another tree, are lifted in a trivial way:
  +
  +
<haskell>
  +
f :: XmlTree -> XmlTree
  +
f t = exp(t)
  +
  +
ff :: XmlTree -> [XmlTree]
  +
ff t = [exp(t)]
  +
</haskell>
  +
  +
This basic function is called <hask>arr</hask>, it comes from the Control.Arrow module of the basic library package of ghc.
  +
  +
Partial functions, functions that can't always compute a result, are usually liftet to totally defined filters:
  +
  +
<haskell>
  +
f :: XmlFilter -- XmlTree -> [XmlTree]
  +
f t
  +
| p t = expr(t)
  +
| otherwise = error "f not defined"
  +
  +
ff :: XmlFilter
  +
ff t
  +
| p t = [expr(t)]
  +
| otherwise = []
  +
</haskell>
  +
  +
This is a rather comfotable situation, with these filters we don't have to deal with illegal argument errors. Illegal arguments are just mapped to the empty list.
  +
  +
When processing trees, there's often the case, that no, exactly one, or more than one result is possible. These functions, returning a set of results are often a bit imprecisely called ''nondeterministic'' functions. These functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible.
  +
  +
''Working with filters is rather similar to working with binary relations, and working with relations is rather natural and comfortable, database people do know this very well.''
  +
  +
Two first examples for working with ''nondeterministic'' functions are selecting the children and the grandchildren of an XmlTree which can be implemented by
  +
  +
<haskell>
  +
getChildren :: XmlFilter
  +
getChildren (NTree n cs)
  +
= cs
  +
  +
getGrandChildren :: XmlFilter
  +
getGrandChildren (NTree n cs)
  +
= concat [ getChildren c | c <- cs ]
  +
</haskell>

Revision as of 14:41, 15 September 2006

The Haskell XML Toolbox (HXT) is a collection of tools for processing XML with Haskell. The core component of the Haskell XML Toolbox is a domain specific language, consisting of a set of combinators, for processing XML trees in a simple and elegant way. The combinator library is based on the concept of arrows. The main component is a validating and namespace aware XML-Parser that supports almost fully the XML 1.0 Standard. Extensions are a validator for RelaxNG and an XPath evaluator.

Background

The Haskell XML Toolbox bases on the ideas of HaXml and HXML but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.

The basic concepts

The basic data strutures

Processing of XML is a task of processing tree structures. This is can be done in Haskell in a very elegant way by defining an appropriate tree data type, a Haskell DOM (document object model) structure. The tree structure in HXT is a rose tree with a special XNode data type for storing the XML node information.

The generally useful tree structure (NTree) is separated from the node type (XNode). This allows for reusing the tree structure and the tree traversal and maipulation functions in other applications.

type NTree a  = NTree a [NTree a]     -- rose tree

data XNode    = XText String          -- plain text node
              | ...
              | XTag QName XmlTrees   -- element name and list of attributes
              | XAttr QName           -- attribute name
              | ...

type QName    = ...                   -- qualified name

type XmlTree  = NTree XNode

type XmlTrees = [XmlTree]

The concept of filters

Selecting, transforming and generating trees often requires routines, which compute not only a single result tree, but a (possibly empty) list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions, which take an XML tree as input and compute a list of result trees.

type XmlFilter = XmlTree -> [XmlTree]

More generally we can define a filter as

type Filter a b = a -> [b]

We will do this abstraction later, when introducing arrows. Many of the functions in the following motivating examples can be generalised this way. But for getting the idea, the XmlFilter is sufficient.

The filter functions are used so frequently, that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes neccessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like (.) becomes one of the most frequently used combinators. there are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part.

The first task is to build filters from pure functions, to define a lift operator. Pure functions are liftet to filters in the following way:

Predicates are lifted by mapping False to the empty list and True to the single element list, containing the input tree.

p   :: XmlTree -> Bool		-- pure function
p t =  ...

pf  :: XmlTree -> [XmlTree]	-- or XmlFilter
pf t
  | p t       = [t]
  | otherwise = []

The combinator for this type of lifting is called isA, it works on any type and is defined as

isA  :: (a -> Bool) -> (a -> [a])
isA p x
  | p x       = [x]
  | otherwise = []

A predicate for filtering text nodes looks like this

isText		              :: XmlFilter       -- XmlTree -> [XmlTrees]
isText t@(NTree (XText _) _) =  [t]
isText _                     =  []

Transformers, function that map a tree into another tree, are lifted in a trivial way:

f	:: XmlTree -> XmlTree
f t	=  exp(t)

ff	:: XmlTree -> [XmlTree]
ff t	= [exp(t)]

This basic function is called arr, it comes from the Control.Arrow module of the basic library package of ghc.

Partial functions, functions that can't always compute a result, are usually liftet to totally defined filters:

f	:: XmlFilter		-- XmlTree -> [XmlTree]
f t
  | p t	      =  expr(t)
  | otherwise = error "f not defined"

ff	:: XmlFilter
ff t
  | p t       = [expr(t)]
  | otherwise = []

This is a rather comfotable situation, with these filters we don't have to deal with illegal argument errors. Illegal arguments are just mapped to the empty list.

When processing trees, there's often the case, that no, exactly one, or more than one result is possible. These functions, returning a set of results are often a bit imprecisely called nondeterministic functions. These functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible.

Working with filters is rather similar to working with binary relations, and working with relations is rather natural and comfortable, database people do know this very well.

Two first examples for working with nondeterministic functions are selecting the children and the grandchildren of an XmlTree which can be implemented by

getChildren	 :: XmlFilter
getChildren (NTree n cs)
  = cs

getGrandChildren :: XmlFilter
getGrandChildren (NTree n cs)
  = concat [ getChildren c | c <- cs ]