Difference between revisions of "HXT"

From HaskellWiki
Jump to navigation Jump to search
m
(clarify an explanation, and attach it to the code it explains)
 
(66 intermediate revisions by 25 users not shown)
Line 1: Line 1:
[[Category:Tools]] [[Category:Tutorials]]
+
[[Category:Web]]
  +
[[Category:XML]]
  +
[[Category:Tools]]
  +
[[Category:Tutorials]]
  +
[[Category:Libraries]]
   
== A gentle introduction to the Haskell XML Toolbox ==
+
== A Gentle Introduction to the Haskell XML Toolbox ==
   
The [http://www.fh-wedel.de/~si/HXmlToolbox/index.html Haskell XML Toolbox (HXT)] is a collection of tools for processing XML with Haskell. The core component of the Haskell XML Toolbox is a domain specific language, consisting of a set of combinators, for processing XML trees in a simple and elegant way. The combinator library is based on the concept of arrows. The main component is a validating and namespace aware XML-Parser that supports almost fully the XML 1.0 Standard. Extensions are a validator for RelaxNG and an XPath evaluator.
+
The [http://www.fh-wedel.de/~si/HXmlToolbox/index.html Haskell XML Toolbox (HXT)] is a collection of tools for processing XML with Haskell. The core component of the Haskell XML Toolbox is a domain specific language consisting of a set of combinators for processing XML trees in a simple and elegant way. The combinator library is based on the concept of arrows. The main component is a validating and namespace-aware XML Parser that supports the XML 1.0 Standard almost completely. Extensions are a validator for RelaxNG and an XPath evaluator.
   
 
__TOC__
 
__TOC__
Line 9: Line 13:
 
== Background ==
 
== Background ==
   
The Haskell XML Toolbox bases on the ideas of [http://www.cs.york.ac.uk/fp/HaXml/ HaXml] and [http://www.flightlab.com/~joe/hxml/ HXML] but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.
+
The Haskell XML Toolbox is based on the ideas of [http://www.cs.york.ac.uk/fp/HaXml/ HaXml] and [http://www.flightlab.com/~joe/hxml/ HXML], but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.
  +
  +
HXT has grown over the years. Components for XPath, XSLT, validation
  +
with RelaxNG, picklers for conversion from/to native Haskell data,
  +
lazy parsing with tagsoup, input via curl and native Haskell HTTP
  +
and others have been added. This has led to a rather large package
  +
with a lot of dependencies.
  +
  +
To make the toolbox more modular and to reduce the dependencies on
  +
other packages, hxt has been split
  +
into various smaller packages since version 9.0.0.
   
 
== Resources ==
 
== Resources ==
   
  +
=== Home Page and Repository ===
; [http://www.fh-wedel.de/~si/HXmlToolbox/index.html HXT Home] :
 
  +
; [http://www.fh-wedel.de/~si/HXmlToolbox/HXT-7.0.tar.gz HXT-7.0.tar.gz] : lastest release
 
; [http://darcs.fh-wedel.de/hxt/ darcs.fh-wedel.de/hxt] : darcs repository with head revision of HXT
+
;[http://www.fh-wedel.de/~si/HXmlToolbox/index.html HXT]: The project home for HXT
  +
;[http://github.com/UweSchmidt/hxt HXT on GitHub]: The git source repository on github for all HXT packages
; [http://darcs.fh-wedel.de/hxt/doc/hdoc_arrow/ Arrow API] : Haddock documentation of head revision with links to source files
 
  +
; [http://darcs.fh-wedel.de/hxt/doc/hdoc/ Complete API] : Haddock documentation with arrows and old API based on filters
 
  +
=== Packages ===
  +
All packages are available on Hackage.
  +
  +
==== HXT-related packages ====
  +
;[http://hackage.haskell.org/package/hxt hxt]:The package [http://hackage.haskell.org/package/hxt hxt] forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, transforming and generating XML/HTML, and so called pickler for conversion from/to XML and native Haskell data.
  +
;[http://hackage.haskell.org/package/HandsomeSoup HandsomeSoup]: HandsomeSoup adds CSS selectors to HXT.
  +
;[https://hackage.haskell.org/package/hxt-css hxt-css]: Another CSS selector engine for HXT.
  +
;[http://hackage.haskell.org/package/hxt-http hxt-http]: Native HTTP support is contained in [http://hackage.haskell.org/package/hxt-http hxt-http] and depends on package [http://hackage.haskell.org/package/HTTP HTTP].
  +
;[http://hackage.haskell.org/package/hxt-curl hxt-curl]:HTTP support via libCurl and package [http://hackage.haskell.org/package/curl curl] is in [http://hackage.haskell.org/package/hxt-curl hxt-curl].
  +
;[http://hackage.haskell.org/package/hxt-tagsoup hxt-tagsoup]:The lazy tagsoup parser can be found in package [http://hackage.haskell.org/package/hxt-tagsoup hxt-tagsoup], only this package depends on Neil Mitchell's [http://hackage.haskell.org/package/tagsoup tagsoup].
  +
;[http://hackage.haskell.org/package/hxt-xpath hxt-xpath]:
  +
;[http://hackage.haskell.org/package/hxt-xslt hxt-xslt]:
  +
;[http://hackage.haskell.org/package/hxt-relaxng hxt-relaxng]: The XPath-, XSLT- and RelaxNG-extensions are separated into [http://hackage.haskell.org/package/hxt-xpath hxt-xpath], [http://hackage.haskell.org/package/hxt-xslt hxt-xslt] and [http://hackage.haskell.org/package/hxt-relaxng hxt-relaxng].
  +
  +
==== More general packages ====
  +
There are some basic functionalities, which are not only of interest for HXT, but can be useful for other none XML/HTML related projects. These have been separated too.
  +
;[http://hackage.haskell.org/package/hxt-charproperties hxt-charproperties]: defines XML- and Unicode character class properties.
  +
;[http://hackage.haskell.org/package/hxt-unicode hxt-unicode]:contains decoding function from various encoding schemes to Unicode. The difference of these functions compared to most of those available on hackage are, that these functions are lazy even in the case of encoding errors (thanks to Henning Thielemann).
  +
;[http://hackage.haskell.org/package/hxt-regex-xmlschema hxt-regex-xmlschema]: contains a lightweight and efficient regex-library. There is full Unicode support, the standard syntax defined in the XML-Schema doc is supported, and there are extensions available for intersection, difference, exclusive OR. The package is self contained, no other regex library is required. The Wiki page [[Regular expressions for XML Schema]] describes the theory behind this regex library and the extensions and gives some usage examples.
  +
;[http://hackage.haskell.org/package/hxt-cache hxt-cache]: A cache for storing parsed XML/HTML pages in binary from. This is used in the Holumbus searchengine framework and the Hayoo! API search for speeding up the repeated indexing of pages.
  +
  +
=== Installation ===
  +
When installing hxt with cabal, one does not have to deal with all the
  +
basic packages. Just a <code>cabal install hxt</code> does the work for the core toolbox. When HTTP access is required, install at least one of
  +
the packages hxt-curl or hxt-http. All other packages can be installed
  +
on demand any time later.
  +
  +
=== Upgrade from HXT versions < 9.0 ===
  +
  +
HXT-9 is not downwards compatible. The splitting into smaller
  +
packages required some internal reorganisation and changes of some type
  +
declarations. To use the main features of the core package, import
  +
<hask>Text.XML.HXT.Core</hask> instead of <hask>Text.XML.HXT.Arrow</hask>.
  +
  +
The second major change was the kind of configuration and option handling.
  +
This was done previously by lists of key-value-pairs implemented as strings.
  +
The growing number of options and the untyped option values have led to
  +
unreliable code. With HXT-9, options are represented in a more type-safe manner by functions instead of strings, so option handling has to be modified when switching to the new version. For examples, see [[#copyXML|copyXML]] and
  +
[[#Pattern for a main program|Pattern for a main program]] below.
   
 
== The basic concepts ==
 
== The basic concepts ==
Line 28: Line 81:
   
 
<haskell>
 
<haskell>
type NTree a = NTree a [NTree a] -- rose tree
+
data NTree a = NTree a [NTree a] -- rose tree
   
 
data XNode = XText String -- plain text node
 
data XNode = XText String -- plain text node
Line 45: Line 98:
 
=== The concept of filters ===
 
=== The concept of filters ===
   
Selecting, transforming and generating trees often requires routines, which compute not only a single result tree, but a (possibly empty) list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions, which take an XML tree as input and compute a list of result trees.
+
Selecting, transforming and generating trees often requires routines which compute not only a single result tree, but a possibly empty list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions which take an XML tree as input and compute a list of result trees:
   
 
<haskell>
 
<haskell>
Line 51: Line 104:
 
</haskell>
 
</haskell>
   
More generally we can define a filter as
+
More generally, we can define a filter as:
   
 
<haskell>
 
<haskell>
Line 57: Line 110:
 
</haskell>
 
</haskell>
   
We will do this abstraction later, when introducing arrows. Many of the functions in the following motivating examples can be generalised this way. But for getting the idea, the <hask>XmlFilter</hask> is sufficient.
+
<p></p>We will do this abstraction later when introducing arrows. Many of the functions in the following motivating examples can be generalized this way. But for getting the idea, the <hask>XmlFilter</hask> is sufficient.
   
The filter functions are used so frequently, that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes necessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like (.) becomes one of the most frequently used combinators. there are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part.
+
<p></p>The filter functions are used so frequently that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes necessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like <hask>(.)</hask>, becomes one of the most frequently used combinators. There are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part.
   
The first task is to build filters from pure functions, to define a lift operator. Pure functions are lifted to filters in the following way:
+
<p></p>To build filters from pure functions, we need a way to lift them to filters. For predicates this is done by mapping <hask>False</hask> to the empty list and <hask>True</hask> to a singleton list containing the input tree:<p></p>
 
Predicates are lifted by mapping False to the empty list and True to the single element list, containing the input tree.
 
   
 
<haskell>
 
<haskell>
p :: XmlTree -> Bool -- pure function
+
p :: XmlTree -> Bool -- pure function
 
p t = ...
 
p t = ...
   
pf :: XmlTree -> [XmlTree] -- or XmlFilter
+
pf :: XmlTree -> [XmlTree] -- or XmlFilter
  +
pf t | p t = [t]
pf t
 
| p t = [t]
+
| otherwise = []
| otherwise = []
 
 
</haskell>
 
</haskell>
   
The combinator for this type of lifting is called <hask>isA</hask>, it works on any type and is defined as
+
<p></p>The combinator for this kind of lifting is called <hask>isA</hask>, and it works on any type. It is defined as:<p></p>
   
 
<haskell>
 
<haskell>
 
isA :: (a -> Bool) -> (a -> [a])
 
isA :: (a -> Bool) -> (a -> [a])
isA p x
+
isA p x | p x = [x]
| p x = [x]
+
| otherwise = []
| otherwise = []
 
 
</haskell>
 
</haskell>
   
A predicate for filtering text nodes looks like this
+
As an example, a predicate for filtering text nodes looks like this:
   
 
<haskell>
 
<haskell>
isXText :: XmlFilter -- XmlTree -> [XmlTrees]
+
isXText :: XmlTree -> Bool
isXText t@(NTree (XText _) _) = [t]
+
isXText (NTree (XText _) _) = True
isXText _ = []
+
isXText _ = False
  +
  +
isXTextF :: XmlFilter -- XmlTree -> [XmlTree]
  +
isXTextF = isA isXText
 
</haskell>
 
</haskell>
   
Transformers, function that map a tree into another tree, are lifted in a trivial way:
+
Transformers, i.e. functions that map a tree into another tree, are lifted in a trivial way via singleton lists:
   
 
<haskell>
 
<haskell>
f :: XmlTree -> XmlTree
+
f :: XmlTree -> XmlTree
f t = exp(t)
+
f t = exp(t)
   
ff :: XmlTree -> [XmlTree]
+
ff :: XmlTree -> [XmlTree]
ff t = [exp(t)]
+
ff t = [exp(t)]
 
</haskell>
 
</haskell>
   
This basic function is called <hask>arr</hask>, it comes from the Control.Arrow module of the basic library package of ghc.
+
<p></p>This basic function is called <hask>arr</hask>, it comes from the <hask>Control.Arrow</hask> module of the <hask>base</hask> package of GHC. Partial functions, i.e functions that can't always compute a result, are usually lifted to totally defined filters where illegal arguments are mapped to the empty list:
 
Partial functions, functions that can't always compute a result, are usually lifted to totally defined filters:
 
   
 
<haskell>
 
<haskell>
f :: XmlTree -> XmlTree
+
f :: XmlTree -> XmlTree
  +
f t | p t = expr(t)
f t
 
| p t = expr(t)
+
| otherwise = error "f not defined"
| otherwise = error "f not defined"
 
   
ff :: XmlFilter
+
ff :: XmlFilter
  +
ff t | p t = [expr(t)]
ff t
 
| p t = [expr(t)]
+
| otherwise = []
| otherwise = []
 
 
</haskell>
 
</haskell>
   
  +
When processing trees, there's often the case that no result, exactly one result, or more than one result is possible. These functions, returning a set of results, are often a bit imprecisely called ''non-deterministic'' functions. Such functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context, lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible. Examples for working with ''non-deterministic'' functions are selecting the children and the grandchildren of an XmlTree, which can be implemented as follows:
This is a rather comfortable situation, with these filters we don't have to deal with illegal argument errors. Illegal arguments are just mapped to the empty list.
 
 
When processing trees, there's often the case, that no, exactly one, or more than one result is possible. These functions, returning a set of results are often a bit imprecisely called ''nondeterministic'' functions. These functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible.
 
 
Working with filters is rather similar to working with binary relations, and working with relations is rather natural and comfortable, database people do know this very well.
 
 
Two first examples for working with ''nondeterministic'' functions are selecting the children and the grandchildren of an XmlTree which can be implemented by
 
   
 
<haskell>
 
<haskell>
getChildren :: XmlFilter
+
getChildren :: XmlFilter
getChildren (NTree n cs)
+
getChildren (NTree n cs) = cs
= cs
 
   
 
getGrandChildren :: XmlFilter
 
getGrandChildren :: XmlFilter
getGrandChildren (NTree n cs)
+
getGrandChildren (NTree n cs) = concat [ getChildren c | c <- cs ]
= concat [ getChildren c | c <- cs ]
 
 
</haskell>
 
</haskell>
  +
  +
Working with filters is rather similar to working with binary relations, which is rather natural and comfortable, database people know this very well.
   
 
=== Filter combinators ===
 
=== Filter combinators ===
   
Composition of filters (like function composition) is the most important combinator. We will use the infix operator <hask>(>>>)</hask> for filter composition and reverse the arguments, so we can read composition sequences from left to right, like with pipes in Unix. Composition is defined as follows:
+
Composition of filters, like function composition, is the most important combinator. We will use the infix operator <hask>(>>>)</hask> for filter composition and reverse the arguments, so we can read composition sequences from left to right, like pipes in Unix. Composition is defined as follows:<p></p>
   
 
<haskell>
 
<haskell>
(>>>) :: XmlFilter -> XmlFilter -> XmlFilter
+
(>>>) :: XmlFilter -> XmlFilter -> XmlFilter
 
 
(f >>> g) t = concat [g t' | t' <- f t]
 
(f >>> g) t = concat [g t' | t' <- f t]
 
</haskell>
 
</haskell>
   
This definition corresponds 1-1 to the composition of binary relations. With help of the <hask>(>>>)</hask> operator the definition of <hask>getGrandChildren</hask> becomes rather simple:
+
<p></p>This definition corresponds 1:1 to the composition of binary relations. With the help of the <hask>(>>>)</hask> operator, the definition of <hask>getGrandChildren</hask> becomes rather simple:<p></p>
   
 
<haskell>
 
<haskell>
Line 153: Line 194:
 
</haskell>
 
</haskell>
   
Selecting all text nodes of the children of an element can also be formulated very easily with the help of <hask>(>>>)</hask>
+
<p></p>Selecting all text nodes of the children of an element can also be formulated very easily with the help of <hask>(>>>)</hask>:<p></p>
   
 
<haskell>
 
<haskell>
Line 160: Line 201:
 
</haskell>
 
</haskell>
   
In case of predicate filter the <hask>(>>>)</hask> serves as a logical and operator, or from the relational view as an intersection operator: <hask>isA p1 >>> isA p2</hask> selects all values for which p1 and p2 both hold.
+
<p></p>When used to combine predicate filters, the <hask>(>>>)</hask> serves as a logical "and" operator or, from the relational view, as an intersection operator: <hask>isA p1 >>> isA p2</hask> selects all values for which p1 and p2 both hold. The dual operator to <hask>(>>>)</hask> is <hask>(<+>)</hask>, the logical "or" for filters:<p></p>
 
The dual operator to <hask>(>>>)</hask> is the logical or, (thinking in sets: The union operator). For this we define a sum operator <hask>(<+>)</hask>. The sum of two filters is defined as follows:
 
   
 
<haskell>
 
<haskell>
(<+>) :: XmlFilter -> XmlFilter -> XmlFilter
+
(<+>) :: XmlFilter -> XmlFilter -> XmlFilter
  +
(f <+> g) t = f t ++ g t -- not literally ++, rather the set union operator
 
  +
-- so without duplicates
(f <+> g) t = f t ++ g t
 
 
</haskell>
 
</haskell>
   
  +
<p></p>Combining elementary filters with <hask>(>>>)</hask> and <hask>(<+>)</hask> leads to more complex functionality. For example, selecting all text nodes within two levels of depth (in left to right order) can be formulated with:<p></p>
Example: <hask>isA p1 <+> isA p2</hask> is the locical or for filter.
 
 
Combining elementary filters with (>>>) and (<+>) leads to more complex functionality. For example, selecting all text nodes within two levels of depth (in left to right order) can be formulated with:
 
   
 
<haskell>
 
<haskell>
Line 179: Line 216:
 
</haskell>
 
</haskell>
   
  +
<p></p><blockquote>
'''Exercise:''' Are these filters equivalent or what's the difference between the two filters?
 
  +
'''Exercise:''' Are these filters equivalent? If not, what is the difference?
   
 
<haskell>
 
<haskell>
Line 186: Line 224:
 
( getChildren >>> isXText ) <+> ( getChildren >>> getChildren >>> isXText )
 
( getChildren >>> isXText ) <+> ( getChildren >>> getChildren >>> isXText )
 
</haskell>
 
</haskell>
  +
</blockquote><p></p>
   
Of course we need choice combinators. The first idea is an if-then-else filter,
+
Of course we need choice combinators, too. The first idea is an if-then-else filter, built from three simpler filters. But often it's easier and more elegant to work with simpler binary combinators for choice. So we will introduce the simpler ones first. One of these choice combinators is called <hask>orElse</hask> and is defined as follows:<p></p>
built up from three simpler filters. But often it's easier and more elegant to work with simpler binary combinators for choice. So we will introduce the simpler ones first.
 
 
One of these choice combinators is called <hask>orElse</hask> and is defined as
 
follows:
 
   
 
<haskell>
 
<haskell>
orElse :: XmlFilter -> XmlFilter -> XmlFilter
+
orElse :: XmlFilter -> XmlFilter -> XmlFilter
orElse f g t
+
orElse f g t | null res1 = g t
  +
| otherwise = res1
| null res1 = g t
 
| otherwise = res1
+
where res1 = f t
where
 
res1 = f t
 
 
</haskell>
 
</haskell>
   
The meaning is the following: If f computes a none empty list as result, f succeeds and this list is the result, else g is applied to the input and this yields the result. There are two other simple choice combinators usually written in infix notation, <hask> g `guards` f</hask> and <hask>f `when` g</hask>:
+
<p></p>If <hask>f</hask> computes a non-empty list as result, <hask>f</hask> succeeds and this list is the result. Otherwise <hask>g</hask> is applied to the input, and this yields the overall result.
  +
  +
There are two other simple choice combinators usually written in infix notation, <hask> g `guards` f</hask> and <hask>f `when` g</hask>, which are useful when transforming and manipulating trees:<p></p>
   
 
<haskell>
 
<haskell>
guards :: XmlFilter -> XmlFilter -> XmlFilter
+
guards :: XmlFilter -> XmlFilter -> XmlFilter
guards g f t
+
guards g f t | null (g t) = []
| null (g t) = []
+
| otherwise = f t
| otherwise = f t
 
   
when :: XmlFilter -> XmlFilter -> XmlFilter
+
when :: XmlFilter -> XmlFilter -> XmlFilter
when f g t
+
when f g t | null (g t) = [t]
| null (g t) = [t]
+
| otherwise = f t
| otherwise = f t
 
 
</haskell>
 
</haskell>
 
These choice operators become useful when transforming and manipulation trees.
 
   
 
=== Tree traversal filter ===
 
=== Tree traversal filter ===
   
A very basic operation on tree structures is the traversal of all nodes and the selection and/or transformation of nodes. Theses traversal filters serve as control structures for processing whole trees. They correspond to the map and fold combinators for lists.
+
A very basic operation on tree structures is the traversal of all nodes and the selection and/or transformation of nodes. These traversal filters serve as control structures for processing whole trees. They correspond to the map and fold combinators for lists.
   
 
The simplest traversal filter does a top down search of all nodes with a special feature. This filter, called <hask>deep</hask>, is defined as follows:
 
The simplest traversal filter does a top down search of all nodes with a special feature. This filter, called <hask>deep</hask>, is defined as follows:
Line 229: Line 260:
 
</haskell>
 
</haskell>
   
When a predicate filter is applied to <hask>deep</hask>, a top down search is done and all subtrees satisfying the predicate are collected. The descent into the tree stops, when a subtree is found because of the use of <hask>orElse</hask>.
+
When a predicate filter is applied to <hask>deep</hask>, a top down search is done and all subtrees satisfying the predicate are collected. The descent into the tree stops when a subtree is found, because of the use of <hask>orElse</hask>.
   
 
'''Example:''' Selecting all plain text nodes of a document can be formulated with:
 
'''Example:''' Selecting all plain text nodes of a document can be formulated with:
Line 244: Line 275:
 
</haskell>
 
</haskell>
   
A variant of <hask>deep</hask>, called <hask>multi</hask>, performs a complete search, where the tree traversal does not stop, when a node is found.
+
A variant of <hask>deep</hask>, called <hask>multi</hask>, performs a complete search, where the tree traversal does not stop when a node is found.
   
 
<haskell>
 
<haskell>
Line 257: Line 288:
 
=== Arrows ===
 
=== Arrows ===
   
We've already seen, that the filters <hask>a -> [b]</hask> are a very
+
We've already seen that the filters <hask>a -> [b]</hask> are a very powerful and sometimes a more elegant way to process XML than pure functions, which is the good news. The bad news is that filters are not general enough. We e.g. want to do some I/O while staying at the filter level, so for the IO monad we need something like:<p></p>
powerful and sometimes a more elegant way to process XML than pure
 
function. This is the good news. The bad news is, that filter are not
 
general enough. Of course we sometimes want to do some I/O and we want
 
to stay in the filter level. So we need something like
 
   
 
<haskell>
 
<haskell>
Line 267: Line 294:
 
</haskell>
 
</haskell>
   
  +
Sometimes it's appropriate to thread some state through the computation like in state monads, which leads to a type like:
for working in the IO monad.
 
 
Sometimes it's appropriate to thread some state through the computation
 
like in state monads. This leads to a type like
 
   
 
<haskell>
 
<haskell>
Line 276: Line 300:
 
</haskell>
 
</haskell>
   
  +
Real-world applications might need both extensions at the same time: Of course I/O is necessary but usually there are also some global options and variables for controlling the computations. In HXT for instance there are variables for controlling trace output, options for setting the default encoding scheme for input data and a base URI for accessing documents, which are addressed in a content or in a DTD part by relative URIs. So we need something like:
And in real world applications we need both extensions at the same
 
time. Of course I/O is necessary but usually there are also some
 
global options and variables for controlling the computations. In HXT,
 
for instance there are variables for controlling trace output, options
 
for setting the default encoding scheme for input data and a base URI
 
for accessing documents, which are addressed in a content or in a DTD
 
part by relative URIs. So we need something like
 
   
 
<haskell>
 
<haskell>
Line 288: Line 306:
 
</haskell>
 
</haskell>
   
  +
<p></p>We want to work with all four filter variants, and in the future perhaps with even more general filters, but of course not with four sets of filter names, e.g. <hask>deep</hask>, <hask>deepST</hask>, <hask>deepIO</hask>, <hask>deepIOST</hask>. This is the point where <hask>newtype</hask>s and <hask>class</hask>es come in. Classes are needed for overloading names and <hask>newtype</hask>s are needed to declare instances. Furthermore, the restriction of <hask>XmlTree</hask> as the argument and result type is not necessary and hinders reuse in many cases.
We want to work with all four filter variants, and in the future
 
perhaps with even more general filters, but of course not with four
 
sets of filter names, e.g. <hask>deep, deepST, deepIO, deepIOST</hask>.
 
   
  +
A filter as discussed above has all features of an arrow. Arrows are introduced for generalizing the concept of functions and function combination to more general kinds of computation.
This is the point where <hask>newtype</hask>s and <hask>class</hask>es
 
come in. Classes are needed for overloading names and
 
<hask>newtype</hask>s are needed to declare instances. Further the
 
restriction of <hask>XmlTree</hask> as argument and result type is
 
not neccessary and hinders reuse in many cases.
 
   
  +
A basic set of combinators for arrows is defined in the classes in the <hask>Control.Arrow</hask> module, containing the above mentioned <hask>(>>>), (<+>)</hask> and <hask>arr</hask>.
A filter discussed above has all features of an arrow. Arrows are
 
introduced for generalising the concept of functions and function
 
combination to more general kinds of computation than pure functions.
 
   
  +
In HXT, the additional classes for filters working with lists as the result type are defined in <hask>Control.Arrow.ArrowList</hask>. The choice operators are in <hask>Control.Arrow.ArrowIf</hask>, tree filters (like <hask>getChildren</hask>, <hask>deep</hask>, or <hask>multi</hask>) in <hask>Control.Arrow.ArrowTree</hask>, and the elementary XML-specific filters in <hask>Text.XML.HXT.XmlArrow</hask>.
A basic set of combinators for arrows is defined in the classes in the
 
<hask>Control.Arrow</hask> module, containing the above mentioned <hask>(>>>), (<+>), arr</hask>.
 
   
  +
In HXT, there are four types instantiated with these classes for pure list arrows, list arrows with a state, list arrows with IO and list arrows with a state and IO.
In HXT the additional classes for filters working with lists as result type are
 
defined in <hask>Control.Arrow.ArrowList</hask>. The choice operators are
 
in <hask>Control.Arrow.ArrowIf</hask>, tree filters, like <hask>getChildren, deep, multi, ...</hask> in
 
<hask>Control.Arrow.ArrowTree</hask> and the elementary XML specific
 
filters in <hask>Text.XML.HXT.XmlArrow</hask>.
 
 
In HXT there are four types instantiated with these classes for
 
pure list arrows, list arrows with a state, list arrows with IO
 
and list arrows with a state and IO.
 
   
 
<haskell>
 
<haskell>
newtype LA a b = LA { runLA :: (a -> [b]) }
+
newtype LA a b = LA { runLA :: ( a -> [b]) }
  +
newtype SLA s a b = SLA { runSLA :: (s -> a -> (s, [b])) }
 
newtype SLA s a b = SLA { runSLA :: (s -> a -> (s, [b])) }
+
newtype IOLA a b = IOLA { runIOLA :: ( a -> IO [b]) }
 
newtype IOLA a b = IOLA { runIOLA :: (a -> IO [b]) }
 
 
 
newtype IOSLA s a b = IOSLA { runIOSLA :: (s -> a -> IO (s, [b])) }
 
newtype IOSLA s a b = IOSLA { runIOSLA :: (s -> a -> IO (s, [b])) }
 
</haskell>
 
</haskell>
   
The first one and the last one are those used most frequently in the
+
The first and last one are those used most frequently in the toolbox, and of course there are lifting functions for converting general arrows into more specific arrows. Don't worry about all these conceptual details. Let's have a look at some ''Hello world'' examples.
toolbox, and of course there are lifting functions for converting the
 
special arrows into the more general ones.
 
 
Don't worry about all these conceptional details. Let's have a look into some
 
''Hello world'' examples.
 
   
 
== Getting started: Hello world examples ==
 
== Getting started: Hello world examples ==
Line 336: Line 329:
 
=== copyXML ===
 
=== copyXML ===
   
The first complete example is a program for
+
The first complete example is a program for copying an XML document:
copying an XML document
 
   
 
<haskell>
 
<haskell>
module Main
+
module Main where
where
 
   
import Text.XML.HXT.Arrow
 
 
import System.Environment
 
import System.Environment
  +
import Text.XML.HXT.Core
  +
import Text.XML.HXT.Curl
   
 
main :: IO ()
 
main :: IO ()
main
+
main = do
= do
+
[src, dst] <- getArgs
  +
runX $
[src, dst] <- getArgs
 
runX ( readDocument [(a_validate, v_0)] src
+
readDocument [withValidate no
  +
,withCurl []
>>>
 
writeDocument [] dst
+
] src
)
+
>>>
  +
writeDocument [withIndent yes
return ()
 
  +
,withOutputEncoding isoLatin1
  +
] dst
  +
return ()
 
</haskell>
 
</haskell>
   
  +
<p></p>The interesting part of this example is the call of <hask>runX</hask>, which executes an arrow. This arrow is one of the more powerful list arrows with IO and an HXT system state. The arrow itself is a composition of <hask>readDocument</hask> and <hask>writeDocument</hask>.
The interesting part of this example is
 
the call of <hask>runX</hask>. <hask>runX</hask> executes an
 
arrow. This arrow is one of the more powerful list arrows with IO and
 
a HXT system state.
 
   
  +
<p></p><hask>readDocument</hask> is an arrow for reading, DTD processing and validation of documents. Its behaviour can be controlled by a list of system options. Here we turn off the validation step. The <hask>src</hask>, a file name or an URI, is read and parsed and a document tree is built. The input option <hask>withCurl []</hask> enables reading via HTTP. For using this option, the extra package hxt-curl must be installed, and <hask>withCurl</hask> must be imported by <hask>import Text.XML.HXT.Curl</hask>. If only file access is necessary, this option and the import can be dropped. In that case the program does not depend on the libcurl binding.
The arrow itself is a composition of <hask>readDocument</hask> and
 
<hask>writeDocument</hask>.
 
<hask>readDocument</hask> is an arrow for reading, DTD processing and
 
validation of documents. Its behaviour can be controlled by a list of
 
options. Here we turn off the validation step. The <hask>src</hask>, a file
 
name or an URI is read and parsed and a document tree is built. This
 
tree is ''piped'' into the output arrow. This one also is
 
controlled by a set of options. Here all the defaults are used.
 
<hask>writeDocument</hask> converts the tree into a string and writes
 
it to the <hask>dst</hask>.
 
   
  +
<p></p>The tree is then ''piped'' into the output arrow, which is once again controlled by a set of system options. The <hask>withIndent</hask> option controls the output formatting (here indentation is switched on) and the <hask>withOutputEncoding</hask> option switches the output to the ISO-Latin1 encoding. <hask>writeDocument</hask> converts the tree into a string and writes it to <hask>dst</hask>. The available options for reading and writing can be found in the module <hask>Text.XML.HXT.Arrow.XmlState.SystemConfig</hask>.
We've omitted here the boring stuff of option parsing and error
 
handling.
 
   
Compilation and a test run looks like this:
+
We've omitted the boring stuff of option parsing and error handling. Compilation and a test run looks like this:
   
 
<pre>
 
<pre>
hobel > ghc -o copyXml -package hxt CopyXML.hs
+
hobel > ghc --make -o copyXml CopyXML.hs
 
hobel > cat hello.xml
 
hobel > cat hello.xml
<hello>world</hello>
+
<hello><haskell>world</haskell></hello>
 
hobel > copyXml hello.xml -
 
hobel > copyXml hello.xml -
<?xml version="1.0" encoding="UTF-8"?>
+
<?xml version="1.0" encoding="ISO-8859-1"?>
<hello>world</hello>
+
<hello>
  +
<haskell>world</haskell>
  +
</hello>
 
hobel >
 
hobel >
 
</pre>
 
</pre>
   
The mini XML document in file <tt>hello.xml</tt> is read and
+
The mini XML document in file <tt>hello.xml</tt> is read, and a document tree is built. Then this tree is converted into a string and written to standard output (filename: <tt>-</tt>). It is decorated with an XML declaration containing the version and the output encoding.
a document tree is built. Then this tree is converted into a string
 
and written to standard output (filename: <tt>-</tt>). It is decorated
 
with an XML declaration containing the version and the output
 
encoding.
 
   
For processing HTML documents there is a HTML parser, which tries to
+
For processing HTML documents there is a HTML parser which tries to parse and interpret almost anything as HTML. This can be selected by using:<p></p>
parse and interprete rather anything as HTML. The HTML parser can be
 
selected by calling
 
   
  +
<haskell>
<hask>readDocument [(a_parse_html, v_1), ...]</hask>
 
  +
readDocument [withParseHTML yes, ...]
 
  +
</haskell>
with the apropriate option.
 
   
 
=== Pattern for a main program ===
 
=== Pattern for a main program ===
Line 411: Line 389:
 
where
 
where
   
import Text.XML.HXT.Arrow
+
import Text.XML.HXT.Core
  +
import Text.XML.HXT.... -- further HXT packages
   
 
import System.IO
 
import System.IO
Line 431: Line 410:
 
-- usually done with 'System.Console.GetOpt'
 
-- usually done with 'System.Console.GetOpt'
   
cmdlineOpts :: [String] -> IO (Attributes, String, String)
+
cmdlineOpts :: [String] -> IO (SysConfigList, String, String)
 
cmdlineOpts argv
 
cmdlineOpts argv
= return ([(a_validate, v_0)], argv!!0, argv!!1)
+
= return ([withValidate no], argv!!0, argv!!1)
   
 
-- | the main arrow
 
-- | the main arrow
   
application :: Attributes -> String -> String -> IOSArrow b Int
+
application :: SysConfigList -> String -> String -> IOSArrow b Int
application al src dst
+
application cfg src dst
  +
= configSysVars cfg -- (0)
= readDocument al src
 
  +
>>>
  +
readDocument [] src
 
>>>
 
>>>
 
processChildren (processDocumentRootElement `when` isElem) -- (1)
 
processChildren (processDocumentRootElement `when` isElem) -- (1)
 
>>>
 
>>>
writeDocument al dst
+
writeDocument [] dst -- (3)
 
>>>
 
>>>
 
getErrStatus
 
getErrStatus
Line 458: Line 439:
 
but it separates the arrow from the boring option evaluation and
 
but it separates the arrow from the boring option evaluation and
 
return code computation.
 
return code computation.
  +
  +
In line (0) the system is configured with the list of options.
  +
These options are then used as defaults for all read and write operation.
  +
The options can be overwritten for single read/write calls
  +
by putting config options into the parameter list of the
  +
read/write function calls.
   
 
The interesing line is (1).
 
The interesing line is (1).
Line 480: Line 467:
   
 
The structure of internal document tree can be made visible
 
The structure of internal document tree can be made visible
e.g. by adding the option pair <hask>(a_show_tree, v_1)</hask> to the
+
e.g. by adding the option <hask>withShowTree yes</hask> to the
<hask>writeDocument</hask> arrow. This will emit the tree in a readable
+
<hask>writeDocument</hask> arrow in (3).
  +
This will emit the tree in a readable
 
text representation instead of the real document.
 
text representation instead of the real document.
   
 
In the next section we will give examples for the
 
In the next section we will give examples for the
 
<hask>processDocumentRootElement</hask> arrow.
 
<hask>processDocumentRootElement</hask> arrow.
  +
  +
=== Tracing ===
  +
  +
There are tracing facilities to observe the actions performed
  +
and to show intermediate results
  +
  +
<haskell>
  +
application :: SysConfigList -> String -> String -> IOSArrow b Int
  +
application cfg src dst
  +
= configSysVars (withTrace 1 : cfg) -- (0)
  +
>>>
  +
traceMsg 1 "start reading document" -- (1)
  +
>>>
  +
readDocument [] src
  +
>>>
  +
traceMsg 1 "document read, start processing" -- (2)
  +
>>>
  +
processChildren (processDocumentRootElement `when` isElem)
  +
>>>
  +
traceMsg 1 "document processed" -- (3)
  +
>>>
  +
writeDocument [] dst
  +
>>>
  +
getErrStatus
  +
</haskell>
  +
  +
In (0) the system trace level is set to 1, in default level 0
  +
all trace messages are suppressed. The three trace messages (1)-(3)
  +
will be issued, but also readDocument and writeDocument will
  +
log their activities.
  +
  +
How a whole document and the internal tree structure can be traced,
  +
is shown in the following example
  +
  +
<haskell>
  +
...
  +
>>>
  +
processChildren (processDocumentRootElement `when` isElem)
  +
>>>
  +
withTraceLevel 4 (traceDoc "resulting document") -- (1)
  +
>>>
  +
...
  +
</haskell>
  +
  +
In (1) the trace level is locally set to the highest level 4.
  +
traceDoc will then issue the trace message, the document formatted
  +
as XML, and the internal DOM tree of the document.
   
 
== Selection examples ==
 
== Selection examples ==
Line 497: Line 532:
 
selectAllText :: ArrowXml a => a XmlTree XmlTree
 
selectAllText :: ArrowXml a => a XmlTree XmlTree
 
selectAllText
 
selectAllText
= deep isXText
+
= deep isText
 
</haskell>
 
</haskell>
   
 
<hask>deep</hask> traverses the whole tree, stops the traversal when
 
<hask>deep</hask> traverses the whole tree, stops the traversal when
a node is a text node (<hask>isXText</hask>) and returns all the text nodes.
+
a node is a text node (<hask>isText</hask>) and returns all the text nodes.
 
There are two other traversal operators <hask>deepest</hask> and <hask>multi</hask>,
 
There are two other traversal operators <hask>deepest</hask> and <hask>multi</hask>,
 
In this case, where the selected nodes are all leaves, these would give the same result.
 
In this case, where the selected nodes are all leaves, these would give the same result.
Line 514: Line 549:
 
selectAllTextAndAltValues
 
selectAllTextAndAltValues
 
= deep
 
= deep
( isXText -- (1)
+
( isText -- (1)
 
<+>
 
<+>
 
( isElem >>> hasName "img" -- (2)
 
( isElem >>> hasName "img" -- (2)
Line 528: Line 563:
 
the alt attribute values are selected as plain text (3), this text is transformed into a text node (4).
 
the alt attribute values are selected as plain text (3), this text is transformed into a text node (4).
   
=== Selecting text and ALT attributes values (2) ===
+
=== Selecting text and ALT attribute values (2) ===
   
 
Let's refine the above filter one step further. The text from the alt attributes shall be marked in the output
 
Let's refine the above filter one step further. The text from the alt attributes shall be marked in the output
Line 537: Line 572:
 
selectAllTextAndRealAltValues
 
selectAllTextAndRealAltValues
 
= deep
 
= deep
( isXText
+
( isText
 
<+>
 
<+>
 
( isElem >>> hasName "img"
 
( isElem >>> hasName "img"
Line 594: Line 629:
 
root [] [helloWorld] -- (1)
 
root [] [helloWorld] -- (1)
 
>>>
 
>>>
writeDocument [(a_indent, v_1)] "hello.xml" -- (2)
+
writeDocument [withIndent yes] "hello.xml" -- (2)
 
</haskell>
 
</haskell>
   
Line 602: Line 637:
 
<hask>writeDocument</hask> and its variants always expect
 
<hask>writeDocument</hask> and its variants always expect
 
a whole document tree with such a root node. Before writing, the document is
 
a whole document tree with such a root node. Before writing, the document is
indented (<hask>(a_indent, v_1)</hask>)) by inserting extra whitespace
+
indented (<hask>withIndent yes</hask>)) by inserting extra whitespace
 
text nodes, and an XML declaration with version and encoding is added. If the indent option is not given, the whole document would appears on a single line:
 
text nodes, and an XML declaration with version and encoding is added. If the indent option is not given, the whole document would appears on a single line:
   
Line 642: Line 677:
   
 
A bit more interesting task is the construction of a page
 
A bit more interesting task is the construction of a page
containg a table of all images within a page inclusive image URLs, geometry and ALT attributes.
+
containing a table of all images within a page inclusive image URLs, geometry and ALT attributes.
   
 
The program for this has a frame similar to the <hask>helloWorld</hask> program,
 
The program for this has a frame similar to the <hask>helloWorld</hask> program,
Line 793: Line 828:
   
 
The generated HTML page is not yet very useful, because it usually
 
The generated HTML page is not yet very useful, because it usually
contains relativ HREFs to the images, so the links do not work.
+
contains relative HREFs to the images, so the links do not work.
 
We have to transform the SRC attribute values into absolute URLs.
 
We have to transform the SRC attribute values into absolute URLs.
 
This can be done with the following code:
 
This can be done with the following code:
Line 838: Line 873:
 
attribute value node (7).
 
attribute value node (7).
   
Because of the use of the use of the global HXT state in <hask>mkAbsURI</hask>
+
Because of the use of the global HXT state in <hask>mkAbsURI</hask>
 
<hask>mkAbsRef</hask> and <hask>imageTable2</hask> need to have the more specialized signature <hask>IOStateArrow s XmlTree XmlTree</hask>.
 
<hask>mkAbsRef</hask> and <hask>imageTable2</hask> need to have the more specialized signature <hask>IOStateArrow s XmlTree XmlTree</hask>.
   
Line 905: Line 940:
 
=== Transform external references into absolute references ===
 
=== Transform external references into absolute references ===
   
In the following example we will develop a program for
+
In the following example we will develop a program for editing a HTML page such that all references to external documents (images, hypertext refs, style refs, ...) become absolute references. We will see some new, but very useful combinators in the solution.
editing a HTML page such that all references to external documents
 
(images, hypertext refs, style refs, ...) become absolute references.
 
We will see some new, but very useful combinators in the solution.
 
   
  +
The task seems to be rather trivial. In a tree traversal all references are edited with respect to the document base. But in HTML there is a BASE element, allowed in the content of HEAD with a HREF attribute, which defines the document base. Again this href can be a relative URL.
The task seems to be rather trivial. In a tree travaersal
 
all references are edited with respect to the document base.
 
But in HTML there is a BASE element, allowed in the content of HEAD
 
with a HREF attribute, which defines the document base. Again this
 
href can be a relative URL.
 
   
We start the development with the editing arrow. This gets
+
We start the development with the editing arrow. This gets the real document base as argument.
the real document base as argument.
 
   
 
<haskell>
 
<haskell>
Line 939: Line 966:
 
</haskell>
 
</haskell>
   
The tree is traversed (1) and for every A element the attribute
+
The tree is traversed (1) and for every A element the attribute list is processed (2). All HREF attribute values (4) are manipulated by <hask>changeAttrValue</hask> called with a string function (5). <hask>expandURIString</hask> is a pure function defined in HXT for computing an absolute URI.
  +
list is processed (2). All HREF attribute values (4) are manipulated
 
by <hask>changeAttrValue</hask> called with a string function (5).
 
<hask>expandURIString</hask> is a pure function defined in HXT for computing
 
an absolut URI.
 
 
In this first step we only edit A-HREF attribute values. We will refine this
 
In this first step we only edit A-HREF attribute values. We will refine this
 
later.
 
later.
Line 976: Line 1,000:
 
with <hask>getBaseURI</hask> (6) like in examples above. The resulting
 
with <hask>getBaseURI</hask> (6) like in examples above. The resulting
 
pair of strings is piped into <hask>expandURI</hask> (7), the arrow version of
 
pair of strings is piped into <hask>expandURI</hask> (7), the arrow version of
<hask>expandURIString</hask>. This arrow ((1) to (7)) fails in the absense
+
<hask>expandURIString</hask>. This arrow ((1) to (7)) fails in the absence
 
of a BASE element. in this case we take the plain document base (8).
 
of a BASE element. in this case we take the plain document base (8).
 
The selection of the BASE elements is not yet very handy. We will define
 
The selection of the BASE elements is not yet very handy. We will define
 
a more general and elegant function later, allowing an element path as selection argument.
 
a more general and elegant function later, allowing an element path as selection argument.
   
  +
In the third step, we will combine the to arrows. For this we will use a new combinator <hask>($<)</hask>. The need for this new combinator is the following: We need the arrow input (the document) two times, once for computing the document base, and second for editing the whole document, and we want to compute the extra string parameter for editing of course with the above defined arrow.
In the third step, we will combine the to arrows. For this we will use
 
a new combinator <hask>($<)</hask>. The need for this new combinator
 
is the following: We need the arrow input (the document) two times,
 
once for computing the document base, and second for editing the
 
whole document, and we want to compute the extra string parameter
 
for editing of course with the above defined arrow.
 
   
The combined arrow, our main arrow, looks like this
+
The combined arrow, our main arrow, looks like this:
   
 
<haskell>
 
<haskell>
Line 1,002: Line 1,021:
 
this pattern occurs rather frequently, so ($<) becomes very useful.
 
this pattern occurs rather frequently, so ($<) becomes very useful.
   
  +
Programming with arrows is one style of point free programming. Point free programming often becomes unhandy when values are used more than once. One solution is the special arrow syntax supported by ghc and others, similar to the do notation for monads. But for many simple cases the <hask>($<)</hask> combinator and its variants <hask>($<<), ($<<<), ($<<<<), ($<$)</hask> is sufficient.
Programming with arrows is one style of point free programming. Point free
 
programming often becomes unhandy when values are used more than once.
 
One solution is the special arrow syntax supported by ghc and others, similar to the do notation for monads. But for many simple cases the <hask>($<)</hask> combinator and it's variants <hask>($<<), ($<<<), ($<<<<), ($<$)</hask>
 
is sufficient.
 
   
To complete the development of the example, a last step is neccessary:
+
To complete the development of the example, a last step is necessary: the removal of the redundant BASE element.
The removal of the redundant BASE element.
 
   
 
<haskell>
 
<haskell>
Line 1,030: Line 1,045:
 
</haskell>
 
</haskell>
   
In this function the children of the HEAD element are searched for
+
In this function the children of the HEAD element are searched for a BASE element. This is removed by aplying the null arrow <hask>none</hask> to the input, returning always the empty list. <hask>none `when` ...</hask> is the pattern for deleting nodes from a tree.
a BASE element. This is removed by aplying the null arrow <hask>none</hask>
 
to the input, returning always the empty list.
 
<hask>none `when` ...</hask> is the pattern for deleting nodes from a tree.
 
   
The <hask>computeBaseRef</hask> function defined above contains an arrow pattern
+
The <hask>computeBaseRef</hask> function defined above contains an arrow pattern for selecting the right sub-tree that is rather common in HXT applications
for selecting the right subtree that is rather common in HXT applications
 
   
 
<haskell>
 
<haskell>
Line 1,051: Line 1,062:
 
</haskell>
 
</haskell>
   
For this pattern we will define a convenient function creating the
+
For this pattern we will define a convenient function creating the arrow for selection:
arrow for selection
 
   
 
<haskell>
 
<haskell>
Line 1,062: Line 1,072:
 
</haskell>
 
</haskell>
   
The name list is mapped to the element checking arrow (2),
+
The name list is mapped to the element checking arrow (2), the resulting list of arrows is folded with <hask>getChildren</hask> into a single arrow. <hask>computeBaseRef</hask> can then be simplified and becomes more readable:
the resulting list of arrows is folded with <hask>getChildren</hask>
 
into a single arrow. <hask>computeBaseRef</hask> can then be simplified
 
and becomes more readable:
 
   
 
<haskell>
 
<haskell>
Line 1,078: Line 1,085:
 
</haskell>
 
</haskell>
   
An even more general and flexible technic are the XPath expressions
+
An even more general and flexible technique can be found with the XPath expressions available for selection of document parts defined in the module <hask>Text.XML.HXT.Arrow.XmlNodeSet</hask>.
available for selection of document parts defined in the module
 
<hask>Text.XML.HXT.Arrow.XmlNodeSet</hask>.
 
   
With XPath <hask>computeBaseRef</hask> can be simplified to
+
With XPath <hask>computeBaseRef</hask> can be simplified to:
   
 
<haskell>
 
<haskell>
Line 1,093: Line 1,098:
 
</haskell>
 
</haskell>
   
Even the attribute selection can be expressed by XPath,
+
Even the attribute selection can be expressed by XPath, so (1) and (2) can be combined into
so (1) and (2) can be combined into
 
   
 
<haskell>
 
<haskell>
Line 1,102: Line 1,106:
 
</haskell>
 
</haskell>
   
The extra <hask>xshow</hask> is here required to convert the
+
The extra <hask>xshow</hask> is here required to convert the XPath result, an XmlTree, into a string.
XPath result, an XmlTree, into a string.
 
   
  +
XPath defines a full language for selecting parts of an XML document. Sometimes it is simpler to make selections of this type, but the XPath evaluation in general is more expensive in time and space than a simple combination of arrowsas we've seen it in <hask>getDescendends</hask>.
XPath defines a
 
full language for selecting parts of an XML document.
 
Sometimes it's rather comfortable to make selections of this
 
type, but the XPath evaluation in general is more expensive
 
in time and space than a simple combination of arrows, like we've
 
seen it in <hask>getDescendends</hask>.
 
   
 
=== Transform external references into absolute references: Refinement ===
 
=== Transform external references into absolute references: Refinement ===
Line 1,185: Line 1,183:
 
== More complex examples ==
 
== More complex examples ==
   
  +
=== Serialization and deserialisation to/from XML ===
''' to be done '''
 
   
=== Automatic read/writing between xml and Haskell data types ===
+
Examples can be found in [[HXT/Conversion of Haskell data from/to XML]]
 
'''Question''': is there any way to write/read Haskell types to/from XML in HXT? HaXml has readXml and showXml, but I can't find any similar mechanism in HXT. Help! -- AlsonKemp
 
 
==== Serializing to Xml ====
 
 
We can create an HXT tree from a single-layer data class as follows:
 
 
<haskell>
 
import IO
 
import Char
 
import Text.XML.HXT.Arrow
 
import Data.Generics
 
 
-- our data class we'll convert into xml
 
data Config =
 
Config { username :: String,
 
logNumDays :: Int,
 
oleDbString :: String }
 
deriving (Show, Typeable,Data)
 
 
-- helper function adapted from http://www.defmacro.org/ramblings/haskell-web.html
 
-- (gshow replaced by gshow')
 
introspectData :: Data a => a -> [(String, String)]
 
introspectData a = zip fields (gmapQ gshow' a)
 
where fields = constrFields $ toConstr a
 
 
gshow' :: Data a => a -> String
 
gshow' t = fromMaybe (showConstr(toConstr t)) (cast t)
 
 
-- function to create xml string from single-layer Haskell data type
 
xmlSerialize object = "<" ++ show(toConstr object) ++ ">" ++
 
foldr (\(a,b) x -> x ++ "<" ++ a ++ ">" ++ b ++ "</" ++ a ++ ">") "" ( introspectData object )
 
++ "</" ++ show(toConstr object) ++ ">"
 
 
-- function to create HXT tree arrow from single-layer Haskell data type:
 
createHxtArrow object = runLA( constA ( xmlSerialize object ) >>> xread)
 
 
-- create a config object to serialize:
 
 
createConfig = Config { username = "test", logNumDays = 3, oleDbString = "qsdf" }
 
 
-- test function, using our Config data type
 
testConversion = createHxtArrow( createConfig ) ()
 
</haskell>
 
 
-- hughperkins
 
 
==== Deserializing from Xml ====
 
 
Here's a solution to deserialize a simple haskell data type containing Strings and Ints.
 
 
It's not really pretty, but it works.
 
 
Basically, we just convert the incoming xml into gread-compatible format, then use gread :-D
 
 
Currently it works for a simple single-layer Haskell data type containing Ints and Strings. You can add new child data types by adding to the case statement in xmlToGShowFormat.
 
 
If someone has a more elegant solution, please let me know ( hughperkins@gmail.com )
 
 
<haskell>
 
module ParseXml
 
where
 
 
import IO
 
import Char
 
import List
 
import Maybe
 
import Data.Generics hiding (Unit)
 
import Text.XML.HXT.Arrow hiding (when)
 
 
data Config = Config{ name :: String, age :: Int }
 
--data Config = Config{ age :: Int }
 
deriving( Data, Show, Typeable, Ord, Eq, Read )
 
 
createConfig = Config "qsdfqsdf" 3
 
--createConfig = Config 3
 
gshow' :: Data a => a -> String
 
gshow' t = fromMaybe (showConstr(toConstr t)) (cast t)
 
 
-- helper function from http://www.defmacro.org/ramblings/haskell-web.html
 
introspectData :: Data a => a -> [(String, String)]
 
introspectData a = zip fields (gmapQ gshow' a)
 
where fields = constrFields $ toConstr a
 
 
-- function to create xml string from single-layer Haskell data type
 
xmlSerialize object = "<" ++ show(toConstr object) ++ ">" ++
 
foldr (\(a,b) x -> x ++ "<" ++ a ++ ">" ++ b ++ "</" ++ a ++ ">") "" ( introspectData object )
 
++ "</" ++ show(toConstr object) ++ ">"
 
 
-- parse xml to HXT tree, and obtain the value of node "fieldname"
 
-- returns a string
 
getValue xml fieldname | length(resultlist) > 0 = Just (head resultlist)
 
| otherwise = Nothing
 
where resultlist = (runLA ( constA xml >>> xread >>> deep ( hasName fieldname ) >>> getChildren >>> getText ))[]
 
 
 
  +
=== Practical examples of HXT ===
-- parse templateobject to get list of field names
 
-- apply these to xml to get list of values
 
-- return (fieldnames list, value list)
 
xmlToGShowFormat :: Data a => String -> a -> String
 
xmlToGShowFormat xml templateobject =
 
go
 
where mainconstructorname = (showConstr $ toConstr templateobject)
 
fields = constrFields $ toConstr templateobject
 
values = map ( \fieldname -> getValue xml fieldname ) fields
 
datatypes = gmapQ (dataTypeOf) templateobject
 
constrs = gmapQ (toConstr) templateobject
 
datatypereps = gmapQ (dataTypeRep . dataTypeOf) templateobject
 
fieldtogshowformat (value,datatyperep) = case datatyperep of
 
IntRep -> "(" ++ fromJust value ++ ")"
 
_ -> show(fromJust value)
 
formattedfieldlist = map (fieldtogshowformat) (zip values datatypereps)
 
go = "(" ++ mainconstructorname ++ " " ++ (concat $ intersperse " " formattedfieldlist ) ++ ")"
 
   
  +
More complex and complete examples of HXT in action
xmlDeserialize xml templateobject = fst $ head $ gread( xmlToGShowFormat xml templateobject)
 
  +
can be found in [[HXT/Practical]]
   
  +
=== The Complete Guide To Working With HTML ===
dotest = xmlDeserialize (xmlSerialize createConfig) createConfig :: Config
 
dotest' = xmlDeserialize ("<Config><age>12</age><name>test name!</name></Config>") createConfig :: Config
 
</haskell>
 
   
  +
Tutorial and Walkthrough: http://adit.io/posts/2012-04-14-working_with_HTML_in_haskell.html
-- hughperkins
 

Latest revision as of 00:20, 27 April 2016


A Gentle Introduction to the Haskell XML Toolbox

The Haskell XML Toolbox (HXT) is a collection of tools for processing XML with Haskell. The core component of the Haskell XML Toolbox is a domain specific language consisting of a set of combinators for processing XML trees in a simple and elegant way. The combinator library is based on the concept of arrows. The main component is a validating and namespace-aware XML Parser that supports the XML 1.0 Standard almost completely. Extensions are a validator for RelaxNG and an XPath evaluator.

Background

The Haskell XML Toolbox is based on the ideas of HaXml and HXML, but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.

HXT has grown over the years. Components for XPath, XSLT, validation with RelaxNG, picklers for conversion from/to native Haskell data, lazy parsing with tagsoup, input via curl and native Haskell HTTP and others have been added. This has led to a rather large package with a lot of dependencies.

To make the toolbox more modular and to reduce the dependencies on other packages, hxt has been split into various smaller packages since version 9.0.0.

Resources

Home Page and Repository

HXT
The project home for HXT
HXT on GitHub
The git source repository on github for all HXT packages

Packages

All packages are available on Hackage.

HXT-related packages

hxt
The package hxt forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, transforming and generating XML/HTML, and so called pickler for conversion from/to XML and native Haskell data.
HandsomeSoup
HandsomeSoup adds CSS selectors to HXT.
hxt-css
Another CSS selector engine for HXT.
hxt-http
Native HTTP support is contained in hxt-http and depends on package HTTP.
hxt-curl
HTTP support via libCurl and package curl is in hxt-curl.
hxt-tagsoup
The lazy tagsoup parser can be found in package hxt-tagsoup, only this package depends on Neil Mitchell's tagsoup.
hxt-xpath
hxt-xslt
hxt-relaxng
The XPath-, XSLT- and RelaxNG-extensions are separated into hxt-xpath, hxt-xslt and hxt-relaxng.

More general packages

There are some basic functionalities, which are not only of interest for HXT, but can be useful for other none XML/HTML related projects. These have been separated too.

hxt-charproperties
defines XML- and Unicode character class properties.
hxt-unicode
contains decoding function from various encoding schemes to Unicode. The difference of these functions compared to most of those available on hackage are, that these functions are lazy even in the case of encoding errors (thanks to Henning Thielemann).
hxt-regex-xmlschema
contains a lightweight and efficient regex-library. There is full Unicode support, the standard syntax defined in the XML-Schema doc is supported, and there are extensions available for intersection, difference, exclusive OR. The package is self contained, no other regex library is required. The Wiki page Regular expressions for XML Schema describes the theory behind this regex library and the extensions and gives some usage examples.
hxt-cache
A cache for storing parsed XML/HTML pages in binary from. This is used in the Holumbus searchengine framework and the Hayoo! API search for speeding up the repeated indexing of pages.

Installation

When installing hxt with cabal, one does not have to deal with all the basic packages. Just a cabal install hxt does the work for the core toolbox. When HTTP access is required, install at least one of the packages hxt-curl or hxt-http. All other packages can be installed on demand any time later.

Upgrade from HXT versions < 9.0

HXT-9 is not downwards compatible. The splitting into smaller packages required some internal reorganisation and changes of some type declarations. To use the main features of the core package, import Text.XML.HXT.Core instead of Text.XML.HXT.Arrow.

The second major change was the kind of configuration and option handling. This was done previously by lists of key-value-pairs implemented as strings. The growing number of options and the untyped option values have led to unreliable code. With HXT-9, options are represented in a more type-safe manner by functions instead of strings, so option handling has to be modified when switching to the new version. For examples, see copyXML and Pattern for a main program below.

The basic concepts

The basic data structures

Processing of XML is a task of processing tree structures. This is can be done in Haskell in a very elegant way by defining an appropriate tree data type, a Haskell DOM (document object model) structure. The tree structure in HXT is a rose tree with a special XNode data type for storing the XML node information.

The generally useful tree structure (NTree) is separated from the node type (XNode). This allows for reusing the tree structure and the tree traversal and manipulation functions in other applications.

data NTree a  = NTree a [NTree a]     -- rose tree

data XNode    = XText String          -- plain text node
              | ...
              | XTag QName XmlTrees   -- element name and list of attributes
              | XAttr QName           -- attribute name
              | ...

type QName    = ...                   -- qualified name

type XmlTree  = NTree XNode

type XmlTrees = [XmlTree]

The concept of filters

Selecting, transforming and generating trees often requires routines which compute not only a single result tree, but a possibly empty list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions which take an XML tree as input and compute a list of result trees:

type XmlFilter = XmlTree -> [XmlTree]

More generally, we can define a filter as:

type Filter a b = a -> [b]

We will do this abstraction later when introducing arrows. Many of the functions in the following motivating examples can be generalized this way. But for getting the idea, the XmlFilter is sufficient.

The filter functions are used so frequently that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes necessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like (.), becomes one of the most frequently used combinators. There are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part.

To build filters from pure functions, we need a way to lift them to filters. For predicates this is done by mapping False to the empty list and True to a singleton list containing the input tree:

p   :: XmlTree -> Bool          -- pure function
p t =  ...

pf  :: XmlTree -> [XmlTree]     -- or XmlFilter
pf t | p t       = [t]
     | otherwise = []

The combinator for this kind of lifting is called isA, and it works on any type. It is defined as:

isA  :: (a -> Bool) -> (a -> [a])
isA p x | p x       = [x]
        | otherwise = []

As an example, a predicate for filtering text nodes looks like this:

isXText :: XmlTree -> Bool
isXText (NTree (XText _) _) = True
isXText _                   = False

isXTextF :: XmlFilter           -- XmlTree -> [XmlTree]
isXTextF = isA isXText

Transformers, i.e. functions that map a tree into another tree, are lifted in a trivial way via singleton lists:

f    :: XmlTree -> XmlTree
f t  =  exp(t)

ff   :: XmlTree -> [XmlTree]
ff t = [exp(t)]

This basic function is called arr, it comes from the Control.Arrow module of the base package of GHC. Partial functions, i.e functions that can't always compute a result, are usually lifted to totally defined filters where illegal arguments are mapped to the empty list:
f :: XmlTree -> XmlTree
f t  | p t       = expr(t)
     | otherwise = error "f not defined"

ff :: XmlFilter
ff t | p t       = [expr(t)]
     | otherwise = []

When processing trees, there's often the case that no result, exactly one result, or more than one result is possible. These functions, returning a set of results, are often a bit imprecisely called non-deterministic functions. Such functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context, lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible. Examples for working with non-deterministic functions are selecting the children and the grandchildren of an XmlTree, which can be implemented as follows:

getChildren :: XmlFilter
getChildren (NTree n cs)  = cs

getGrandChildren :: XmlFilter
getGrandChildren (NTree n cs) = concat [ getChildren c | c <- cs ]

Working with filters is rather similar to working with binary relations, which is rather natural and comfortable, database people know this very well.

Filter combinators

Composition of filters, like function composition, is the most important combinator. We will use the infix operator (>>>) for filter composition and reverse the arguments, so we can read composition sequences from left to right, like pipes in Unix. Composition is defined as follows:

(>>>) :: XmlFilter -> XmlFilter -> XmlFilter
(f >>> g) t =  concat [g t' | t' <- f t]

This definition corresponds 1:1 to the composition of binary relations. With the help of the (>>>) operator, the definition of getGrandChildren becomes rather simple:

getGrandChildren :: XmlFilter
getGrandChildren = getChildren >>> getChildren

Selecting all text nodes of the children of an element can also be formulated very easily with the help of (>>>):

getTextChildren :: XmlFilter
getTextChildren = getChildren >>> isXText

When used to combine predicate filters, the (>>>) serves as a logical "and" operator or, from the relational view, as an intersection operator: isA p1 >>> isA p2 selects all values for which p1 and p2 both hold. The dual operator to (>>>) is (<+>), the logical "or" for filters:

(<+>) :: XmlFilter -> XmlFilter -> XmlFilter
(f <+> g) t =  f t ++ g t -- not literally ++, rather the set union operator
  -- so without duplicates

Combining elementary filters with (>>>) and (<+>) leads to more complex functionality. For example, selecting all text nodes within two levels of depth (in left to right order) can be formulated with:

getTextChildren2 :: XmlFilter
getTextChildren2 = getChildren >>> ( isXText <+> ( getChildren >>> isXText ) )

Exercise: Are these filters equivalent? If not, what is the difference?

getChildren >>> ( isXText <+> ( getChildren >>> isXText ) )

( getChildren >>> isXText ) <+> ( getChildren >>> getChildren >>> isXText )

Of course we need choice combinators, too. The first idea is an if-then-else filter, built from three simpler filters. But often it's easier and more elegant to work with simpler binary combinators for choice. So we will introduce the simpler ones first. One of these choice combinators is called orElse and is defined as follows:

orElse :: XmlFilter -> XmlFilter -> XmlFilter
orElse f g t | null res1 = g t
             | otherwise = res1
  where res1 = f t

If f computes a non-empty list as result, f succeeds and this list is the result. Otherwise g is applied to the input, and this yields the overall result. There are two other simple choice combinators usually written in infix notation, g `guards` f and f `when` g, which are useful when transforming and manipulating trees:

guards :: XmlFilter -> XmlFilter -> XmlFilter
guards g f t | null (g t) = []
             | otherwise  = f t

when :: XmlFilter -> XmlFilter -> XmlFilter
when f g t | null (g t) = [t]
           | otherwise  = f t

Tree traversal filter

A very basic operation on tree structures is the traversal of all nodes and the selection and/or transformation of nodes. These traversal filters serve as control structures for processing whole trees. They correspond to the map and fold combinators for lists.

The simplest traversal filter does a top down search of all nodes with a special feature. This filter, called deep, is defined as follows:

deep	:: XmlFilter -> XmlFilter
deep f  = f `orElse` (getChildren >>> deep f)

When a predicate filter is applied to deep, a top down search is done and all subtrees satisfying the predicate are collected. The descent into the tree stops when a subtree is found, because of the use of orElse.

Example: Selecting all plain text nodes of a document can be formulated with:

deep isXText

Example: Selecting all "top level" tables in a HTML documents looks like this:

deep (isElem >>> hasName "table")

A variant of deep, called multi, performs a complete search, where the tree traversal does not stop when a node is found.

multi	:: XmlFilter -> XmlFilter
multi f  = f <+> (getChildren >>> multi f)

Example: Selecting all tables in a HTML document, even nested ones, multi has to be used instead of deep:

multi (isElem >>> hasName "table")

Arrows

We've already seen that the filters a -> [b] are a very powerful and sometimes a more elegant way to process XML than pure functions, which is the good news. The bad news is that filters are not general enough. We e.g. want to do some I/O while staying at the filter level, so for the IO monad we need something like:

type XmlIOFilter = XmlTree -> IO [XmlTree]

Sometimes it's appropriate to thread some state through the computation like in state monads, which leads to a type like:

type XmlStateFilter state = state -> XmlTree -> (state, [XmlTree])

Real-world applications might need both extensions at the same time: Of course I/O is necessary but usually there are also some global options and variables for controlling the computations. In HXT for instance there are variables for controlling trace output, options for setting the default encoding scheme for input data and a base URI for accessing documents, which are addressed in a content or in a DTD part by relative URIs. So we need something like:

type XmlIOStateFilter state = state -> XmlTree -> IO (state, [XmlTree])

We want to work with all four filter variants, and in the future perhaps with even more general filters, but of course not with four sets of filter names, e.g. deep, deepST, deepIO, deepIOST. This is the point where newtypes and classes come in. Classes are needed for overloading names and newtypes are needed to declare instances. Furthermore, the restriction of XmlTree as the argument and result type is not necessary and hinders reuse in many cases.

A filter as discussed above has all features of an arrow. Arrows are introduced for generalizing the concept of functions and function combination to more general kinds of computation.

A basic set of combinators for arrows is defined in the classes in the Control.Arrow module, containing the above mentioned (>>>), (<+>) and arr.

In HXT, the additional classes for filters working with lists as the result type are defined in Control.Arrow.ArrowList. The choice operators are in Control.Arrow.ArrowIf, tree filters (like getChildren, deep, or multi) in Control.Arrow.ArrowTree, and the elementary XML-specific filters in Text.XML.HXT.XmlArrow.

In HXT, there are four types instantiated with these classes for pure list arrows, list arrows with a state, list arrows with IO and list arrows with a state and IO.

newtype LA      a b = LA    { runLA    :: (     a ->        [b])  }
newtype SLA   s a b = SLA   { runSLA   :: (s -> a ->    (s, [b])) }
newtype IOLA    a b = IOLA  { runIOLA  :: (     a -> IO     [b])  }
newtype IOSLA s a b = IOSLA { runIOSLA :: (s -> a -> IO (s, [b])) }

The first and last one are those used most frequently in the toolbox, and of course there are lifting functions for converting general arrows into more specific arrows. Don't worry about all these conceptual details. Let's have a look at some Hello world examples.

Getting started: Hello world examples

copyXML

The first complete example is a program for copying an XML document:

module Main where

import System.Environment
import Text.XML.HXT.Core
import Text.XML.HXT.Curl

main :: IO ()
main = do
  [src, dst] <- getArgs
  runX $
    readDocument [withValidate no
                 ,withCurl []
                 ] src
    >>>
    writeDocument [withIndent yes
                  ,withOutputEncoding isoLatin1
                  ] dst
  return ()

The interesting part of this example is the call of runX, which executes an arrow. This arrow is one of the more powerful list arrows with IO and an HXT system state. The arrow itself is a composition of readDocument and writeDocument.

readDocument is an arrow for reading, DTD processing and validation of documents. Its behaviour can be controlled by a list of system options. Here we turn off the validation step. The src, a file name or an URI, is read and parsed and a document tree is built. The input option withCurl [] enables reading via HTTP. For using this option, the extra package hxt-curl must be installed, and withCurl must be imported by import Text.XML.HXT.Curl. If only file access is necessary, this option and the import can be dropped. In that case the program does not depend on the libcurl binding.

The tree is then piped into the output arrow, which is once again controlled by a set of system options. The withIndent option controls the output formatting (here indentation is switched on) and the withOutputEncoding option switches the output to the ISO-Latin1 encoding. writeDocument converts the tree into a string and writes it to dst. The available options for reading and writing can be found in the module Text.XML.HXT.Arrow.XmlState.SystemConfig.

We've omitted the boring stuff of option parsing and error handling. Compilation and a test run looks like this:

hobel > ghc --make -o copyXml CopyXML.hs
hobel > cat hello.xml
<hello><haskell>world</haskell></hello>
hobel > copyXml hello.xml -
<?xml version="1.0" encoding="ISO-8859-1"?>
<hello>
  <haskell>world</haskell>
</hello>
hobel >

The mini XML document in file hello.xml is read, and a document tree is built. Then this tree is converted into a string and written to standard output (filename: -). It is decorated with an XML declaration containing the version and the output encoding.

For processing HTML documents there is a HTML parser which tries to parse and interpret almost anything as HTML. This can be selected by using:

readDocument [withParseHTML yes, ...]

Pattern for a main program

A more realistic pattern for a simple Unix filter like program has the following structure:

module Main
where

import Text.XML.HXT.Core
import Text.XML.HXT....   -- further HXT packages

import System.IO
import System.Environment
import System.Console.GetOpt
import System.Exit

main :: IO ()
main
    = do
      argv <- getArgs
      (al, src, dst) <- cmdlineOpts argv
      [rc]  <- runX (application al src dst)
      if rc >= c_err
	 then exitWith (ExitFailure (0-1))
	 else exitWith ExitSuccess

-- | the dummy for the boring stuff of option evaluation,
-- usually done with 'System.Console.GetOpt'

cmdlineOpts 	:: [String] -> IO (SysConfigList, String, String)
cmdlineOpts argv
    = return ([withValidate no], argv!!0, argv!!1)

-- | the main arrow

application	:: SysConfigList -> String -> String -> IOSArrow b Int
application cfg src dst
    = configSysVars cfg                                           -- (0)
      >>>
      readDocument [] src
      >>>
      processChildren (processDocumentRootElement `when` isElem)  -- (1)
      >>>
      writeDocument [] dst                                        -- (3)
      >>>
      getErrStatus


-- | the dummy for the real processing: the identity filter

processDocumentRootElement	:: IOSArrow XmlTree XmlTree
processDocumentRootElement
    = this         -- substitute this by the real application

This program has the same functionality as our first example, but it separates the arrow from the boring option evaluation and return code computation.

In line (0) the system is configured with the list of options. These options are then used as defaults for all read and write operation. The options can be overwritten for single read/write calls by putting config options into the parameter list of the read/write function calls.

The interesing line is (1). readDocument generates a tree structure with a so called extra root node. This root node is a node above the XML document root element. The node above the XML document root element is neccessary because of possible other elements on the same tree level as the XML root, for instance comments, processing instructions or whitespace.

Furthermore the artificial root node serves for storing meta information about the document in the attribute list, like the document name, the encoding scheme, the HTTP transfer headers and other information.

To process the real XML root element, we have to take the children of the root node, select the XML root element and process this, but remain all other children unchanged. This is done with processChildren and the when choice operator. processChildren applies a filter elementwise to all children of a node. All results form processing the list of children from the result node.

The structure of internal document tree can be made visible e.g. by adding the option withShowTree yes to the writeDocument arrow in (3). This will emit the tree in a readable text representation instead of the real document.

In the next section we will give examples for the processDocumentRootElement arrow.

Tracing

There are tracing facilities to observe the actions performed and to show intermediate results

application	:: SysConfigList -> String -> String -> IOSArrow b Int
application cfg src dst
    = configSysVars (withTrace 1 : cfg)                           -- (0)
      >>>
      traceMsg 1 "start reading document"                         -- (1)
      >>>
      readDocument [] src
      >>>
      traceMsg 1 "document read, start processing"                -- (2)
      >>>
      processChildren (processDocumentRootElement `when` isElem)
      >>>
      traceMsg 1 "document processed"                             -- (3)
      >>>
      writeDocument [] dst
      >>>
      getErrStatus

In (0) the system trace level is set to 1, in default level 0 all trace messages are suppressed. The three trace messages (1)-(3) will be issued, but also readDocument and writeDocument will log their activities.

How a whole document and the internal tree structure can be traced, is shown in the following example

      ...
      >>>
      processChildren (processDocumentRootElement `when` isElem)
      >>>
      withTraceLevel 4 (traceDoc "resulting document")      -- (1)
      >>>
      ...

In (1) the trace level is locally set to the highest level 4. traceDoc will then issue the trace message, the document formatted as XML, and the internal DOM tree of the document.

Selection examples

Selecting text from an HTML document

Selecting all the plain text of an XML/HTML document can be formulated with

selectAllText	:: ArrowXml a => a XmlTree XmlTree
selectAllText
    = deep isText

deep traverses the whole tree, stops the traversal when a node is a text node (isText) and returns all the text nodes. There are two other traversal operators deepest and multi, In this case, where the selected nodes are all leaves, these would give the same result.

Selecting text and ALT attribute values

Let's take a bit more complex task: We want to select all text, but also the values of the alt attributes of image tags.

selectAllTextAndAltValues	:: ArrowXml a => a XmlTree XmlTree
selectAllTextAndAltValues
    = deep
      ( isText                       -- (1)
	<+>
	( isElem >>> hasName "img"   -- (2)
	  >>>
	  getAttrValue "alt"         -- (3)
	  >>>
	  mkText                     -- (4)
	)
      )

The whole tree is searched for text nodes (1) and for image elements (2), from the image elements the alt attribute values are selected as plain text (3), this text is transformed into a text node (4).

Selecting text and ALT attribute values (2)

Let's refine the above filter one step further. The text from the alt attributes shall be marked in the output by surrounding double square brackets. Empty alt values shall be ignored.

selectAllTextAndRealAltValues	:: ArrowXml a => a XmlTree XmlTree
selectAllTextAndRealAltValues
    = deep
      ( isText
	<+>
	( isElem >>> hasName "img"
	  >>>
	  getAttrValue "alt"
	  >>>
	  isA significant            -- (1)
	  >>>
	  arr addBrackets            -- (2)
	  >>>
	  mkText
	)
      )
    where
    significant :: String -> Bool
    significant = not . all (`elem` " \n\r\t")

    addBrackets :: String -> String
    addBrackets s
	=  " [[ " ++ s ++ " ]] "

This example shows two combinators for building arrows from pure functions. The first one isA removes all empty or whitespace values from alt attributes (1), the other arr lifts the editing function to the arrow level (2).

Document construction examples

The Hello World document

The first document, of course, is a Hello World document:

helloWorld	:: ArrowXml a => a XmlTree XmlTree
helloWorld
    = mkelem "html" []              -- (1)
      [ mkelem "head" []
	[ mkelem "title" []
	  [ txt "Hello World" ]     -- (2)
	]
      , mkelem "body"
	[ sattr "class" "haskell" ] -- (3)
	[ mkelem "h1" []
	  [ txt "Hello World" ]     -- (4)
	]
      ]

The main arrows for document construction are mkelem and it's variants (selem, aelem, eelem) for element creation, attr and sattr for attributes and mktext and txt for text nodes. mkelem takes three arguments, the element name (or tag name), a list of arrows for the construction of attributes, not empty in (3), and a list of arrows for the contents. Text content is generated in (2) and (4).

To write this document to a file use the following arrow

root [] [helloWorld]                         -- (1)
>>>
writeDocument [withIndent yes] "hello.xml"   -- (2)

When this arrow is executed, the helloWorld document is wrapped into a so called root node (1). This complete document is written to "hello.xml" (2). writeDocument and its variants always expect a whole document tree with such a root node. Before writing, the document is indented (withIndent yes)) by inserting extra whitespace text nodes, and an XML declaration with version and encoding is added. If the indent option is not given, the whole document would appears on a single line:

<?xml version="1.0" encoding="UTF-8"?>
<html>
  <head>
    <title>Hello World</title>
  </head>
  <body class="haskell">
    <h1>Hello World</h1>
  </body>
</html>

The code can be shortened a bit by using some of the convenient functions:

helloWorld2	:: ArrowXml a => a XmlTree XmlTree
helloWorld2
    = selem "html"
      [ selem "head"
	[ selem "title"
	  [ txt "Hello World" ]
	]
      , mkelem "body"
	[ sattr "class" "haskell" ]
	[ selem "h1"
	  [ txt "Hello World" ]
	]
      ]

In the above two examples the arrow input is totally ignored, because of the use of the constant arrow txt "...".

A page about all images within a HTML page

A bit more interesting task is the construction of a page containing a table of all images within a page inclusive image URLs, geometry and ALT attributes.

The program for this has a frame similar to the helloWorld program, but the rows of the table must be filled in from the input document. In the first step we will generate a table with a single column containing the URL of the image.

imageTable	:: ArrowXml a => a XmlTree XmlTree
imageTable
    = selem "html"
      [ selem "head"
	[ selem "title"
	  [ txt "Images in Page" ]
	]
      , selem "body"
	[ selem "h1"
	  [ txt "Images in Page" ]
	, selem "table"
	  [ collectImages           -- (1)
	    >>>
	    genTableRows            -- (2)
	  ]
	]
      ]
    where
    collectImages                   -- (1)
	= deep ( isElem
                 >>>
                 hasName "img"
               )
    genTableRows                    -- (2)
	= selem "tr"
	  [ selem "td"
	    [ getAttrValue "src" >>> mkText ]
	  ]

With (1) the image elements are collected, and with (2) the HTML code for an image element is built.

Applied to http://www.haskell.org/ we get the following result (at the time writing this page):

<html>
  <head>
    <title>Images in Page</title>
  </head>
  <body>
    <h1>Images in Page</h1>
    <table>
      <tr>
        <td>/haskellwiki_logo.png</td>
      </tr>
      <tr>
        <td>/sitewiki/images/1/10/Haskelllogo-small.jpg</td>
      </tr>
      <tr>
        <td>/haskellwiki_logo_small.png</td>
      </tr>
    </table>
  </body>
</html>

When generating HTML, often there are constant parts within the page, in the example e.g. the page header. It's possible to write these parts as a string containing plain HTML and then read this with a simple XML contents parser called xread.

The example above could then be rewritten as

imageTable
    = selem "html"
      [ pageHeader
      , ...
      ]
    where
    pageHeader
	= constA "<head><title>Images in Page</title></head>"
	  >>>
	  xread
    ...

xread is a very primitive arrow. It does not run in the IO monad, so it can be used in any context, but therefore the error handling is very limited. xread parses an XML element content.

A page about all images within a HTML page: 1. Refinement

The next refinement step is the extension of the table such that it contains four columns, one for the image itself, one for the URL, the geometry and the ALT text. The extended getTableRows has the following form:

    genTableRows
	= selem "tr"
	  [ selem "td"			-- (1)
	    [ this                      -- (1.1)
	    ]
	  , selem "td"                  -- (2)
	    [ getAttrValue "src"
	      >>>
	      mkText
	      >>>
	      mkelem "a"                -- (2.1)
	      [ attr "href" this ]
	      [ this ]
	    ]
	  , selem "td"                  -- (3)
	    [ ( getAttrValue "width"
		&&&                     -- (3.1)
		getAttrValue "height"
	      )
	      >>>
	      arr2 geometry             -- (3.2)
	      >>>
	      mkText
	    ]
	  , selem "td"                  -- (4)
	    [ getAttrValue "alt"
	      >>>
	      mkText
	    ]
	  ]
        where
	geometry :: String -> String -> String
	geometry "" ""
	    = ""
	geometry w h
	    = w ++ "x" ++ h

In (1) the identity arrow this is used for inserting the whole image element (this value) into the first column. (2) is the column from the previous example but the URL has been made active by embedding the URL in an A-element (2.1). In (3) there are two new combinators, (&&&) (3.1) is an arrow for applying two arrows to the same input and combine the results into a pair. arr2 works like arr but it lifts a binary function into an arrow accepting a pair of values. arr2 f is a shortcut for arr (uncurry f). So width and height are combined into an X11 like geometry spec. (4) adds the ALT-text.

A page about all images within a HTML page: 2. Refinement

The generated HTML page is not yet very useful, because it usually contains relative HREFs to the images, so the links do not work. We have to transform the SRC attribute values into absolute URLs. This can be done with the following code:

imageTable2	:: IOStateArrow s XmlTree XmlTree
imageTable2
    = ...
      ...
	, selem "table"
	  [ collectImages
	    >>>
	    mkAbsImageRef                          -- (1)
	    >>>
	    genTableRows
	  ]
       ...

mkAbsImageRef :: IOStateArrow s XmlTree XmlTree    -- (1)
mkAbsImageRef
    = processAttrl ( mkAbsRef                      -- (2)
                     `when`
                     hasName "src"                 -- (3)
                   )
      where
      mkAbsRef                                     -- (4)
	  = replaceChildren
	    ( xshow getChildren                    -- (5)
	      >>>
	      ( mkAbsURI `orElse` this )           -- (6)
	      >>>
	      mkText                               -- (7)
	    )

The imageTable2 is extended by an arrow mkAbsImageRef (1). This arrow uses the global system state of HXT, in which the base URL of a document is stored. For editing the SRC attribute value, the attribute list of the image elements is processed with processAttrl. With the `when` hasName "src" only SRC attributes are manipulated (3). The real work is done in (4): The URL is selected with getChildren, a text node, and converted into a string (xshow), the URL is transformed into an absolute URL with mkAbsURI (6). This arrow may fail, e.g. in case of illegal URLs. In this case the URL remains unchanged (`orElse` this). The resulting String value is converted into a text node forming the new attribute value node (7).

Because of the use of the global HXT state in mkAbsURI mkAbsRef and imageTable2 need to have the more specialized signature IOStateArrow s XmlTree XmlTree.

Transformation examples

Decorating external references of an HTML document

In the following examples, we want to decorate the external references in an HTML page by a small icon, like it's done in many wikis. For this task the document tree has to be traversed, all parts except the intersting A-Elements remain unchanged. At the end of the list of children of an A-Element we add an image element.

Here is the first version:

addRefIcon	:: ArrowXml a => a XmlTree XmlTree
addRefIcon
    = processTopDown                       -- (1)
      ( addImg                             -- (2)
	`when`
	isExternalRef                      -- (3)
      )
    where
    isExternalRef                          -- (4)
	= isElem
	  >>>
	  hasName "a"
          >>>
	  hasAttr "href"
	  >>>
	  getAttrValue "href"
	  >>>
	  isA isExtRef
	where
	isExtRef                           -- (4.1)
	    = isPrefixOf "http:"           -- or something more precise

    addImg
	= replaceChildren                  -- (5)
	  ( getChildren                    -- (6)
	    <+>
	    imgElement                     -- (7)
	  )

    imgElement
	= mkelem "img"                     -- (8)
	  [ sattr "src" "/icons/ref.png"   -- (9)
	  , sattr "alt" "external ref"
	  ] []                             -- (10)

The traversal is done with processTopDown (1). This arrow applies an arrow to all nodes of the whole document tree. The transformation arrow applies the addImg (2) to all A-elements (3),(4). This arrow uses a bit simplified test (4.1) for external URLs. addImg manipulates all children (5) of the A-elements by selecting the current children (6) and adding an image element (7). The image element is constructed with mkelem (8). This takes an element name, a list of arrows for computing the attributes and a list of arrows for computing the contents. The content of the image element is empty (10). The attributes are constructed with sattr (9). sattr ignores the arrow input and builds an attribute form the name value pair of arguments.

Transform external references into absolute references

In the following example we will develop a program for editing a HTML page such that all references to external documents (images, hypertext refs, style refs, ...) become absolute references. We will see some new, but very useful combinators in the solution.

The task seems to be rather trivial. In a tree traversal all references are edited with respect to the document base. But in HTML there is a BASE element, allowed in the content of HEAD with a HREF attribute, which defines the document base. Again this href can be a relative URL.

We start the development with the editing arrow. This gets the real document base as argument.

mkAbsHRefs	:: ArrowXml a => String -> a XmlTree XmlTree
mkAbsHRefs base
    = processTopDown editHRef                   -- (1)
    where
    editHRef
	= processAttrl                          -- (3)
            ( changeAttrValue (absHRef base)    -- (5)
	      `when`
	      hasName "href"                    -- (4)
	     )
	  `when`
	  ( isElem >>> hasName "a" )            -- (2)
	where

	absHRef	:: String -> String -> String   -- (5)
	absHRef base url
	    = fromMaybe url . expandURIString url $ base

The tree is traversed (1) and for every A element the attribute list is processed (2). All HREF attribute values (4) are manipulated by changeAttrValue called with a string function (5). expandURIString is a pure function defined in HXT for computing an absolute URI.

In this first step we only edit A-HREF attribute values. We will refine this later.

The second step is the complete computation of the base URL.

computeBaseRef	:: IOStateArrow s XmlTree String
computeBaseRef
    = ( ( ( isElem >>> hasName "html"      -- (0)
            >>>
            getChildren                    -- (1)
	    >>>
	    isElem >>> hasName "head"      -- (2)
	    >>>
	    getChildren                    -- (3)
	    >>>
	    isElem >>> hasName "base"      -- (4)
	    >>>
	    getAttrValue "href"            -- (5)
	  )
	  &&&
	  getBaseURI                       -- (6)
	)
	>>> expandURI                      -- (7)
      )
      `orElse` getBaseURI                  -- (8)

Input to this arrow is the HTML element, (0) to (5) is the arrow for selecting the BASE elements HREF value, parallel to this the system base URL is read with getBaseURI (6) like in examples above. The resulting pair of strings is piped into expandURI (7), the arrow version of expandURIString. This arrow ((1) to (7)) fails in the absence of a BASE element. in this case we take the plain document base (8). The selection of the BASE elements is not yet very handy. We will define a more general and elegant function later, allowing an element path as selection argument.

In the third step, we will combine the to arrows. For this we will use a new combinator ($<). The need for this new combinator is the following: We need the arrow input (the document) two times, once for computing the document base, and second for editing the whole document, and we want to compute the extra string parameter for editing of course with the above defined arrow.

The combined arrow, our main arrow, looks like this:

toAbsRefs	:: IOStateArrow s XmlTree XmlTree
toAbsRefs
    = mkAbsHRefs $< computeBaseRef        -- (1)

In (1) first the arrow input is piped into computeBaseRef, this result is used in mkAbsHRefs as extra string parameter when processing the document. Internally the ($<) combinator is defined by the basic combinators (&&&), (>>>) and app, but in a bit more complex computations, this pattern occurs rather frequently, so ($<) becomes very useful.

Programming with arrows is one style of point free programming. Point free programming often becomes unhandy when values are used more than once. One solution is the special arrow syntax supported by ghc and others, similar to the do notation for monads. But for many simple cases the ($<) combinator and its variants ($<<), ($<<<), ($<<<<), ($<$) is sufficient.

To complete the development of the example, a last step is necessary: the removal of the redundant BASE element.

toAbsRefs	:: IOStateArrow s XmlTree XmlTree
toAbsRefs
    = ( mkAbsHRefs $< computeBaseRef )
      >>>
      removeBaseElement
      
removeBaseElement	:: ArrowXml a => a XmlTree XmlTree
removeBaseElement
    = processChildren
        ( processChildren
            ( none                            -- (1)
              `when`
              ( isElem >>> hasName "base" )
            )
          `when`
          ( isElem >>> hasName "head" )
        )

In this function the children of the HEAD element are searched for a BASE element. This is removed by aplying the null arrow none to the input, returning always the empty list. none `when` ... is the pattern for deleting nodes from a tree.

The computeBaseRef function defined above contains an arrow pattern for selecting the right sub-tree that is rather common in HXT applications

isElem >>> hasName n1
>>>
getChildren
>>>
isElem >>> hasName n2
...
>>>
getChildren
>>>
isElem >>> hasName nm

For this pattern we will define a convenient function creating the arrow for selection:

getDescendents :: ArrowXml a => [String] -> a XmlTree XmlTree
getDescendents
    = foldl1 (\ x y -> x >>> getChildren >>> y)   -- (1)
      .
      map (\ n -> isElem >>> hasName n)           -- (2)

The name list is mapped to the element checking arrow (2), the resulting list of arrows is folded with getChildren into a single arrow. computeBaseRef can then be simplified and becomes more readable:

computeBaseRef	:: IOStateArrow s XmlTree String
computeBaseRef
    = ( ( ( getDescendents ["html","head","base"]  -- (1)
	    >>>
	    getAttrValue "href"                    -- (2)
	  )
      ...
      ...

An even more general and flexible technique can be found with the XPath expressions available for selection of document parts defined in the module Text.XML.HXT.Arrow.XmlNodeSet.

With XPath computeBaseRef can be simplified to:

computeBaseRef
    = ( ( ( getXPathTrees "/html/head/base"        -- (1)
            >>>
	    getAttrValue "href"                    -- (2)
	  )
      ...

Even the attribute selection can be expressed by XPath, so (1) and (2) can be combined into

computeBaseRef
    = ( ( xshow (getXPathTrees "/html/head/base@href")
      ...

The extra xshow is here required to convert the XPath result, an XmlTree, into a string.

XPath defines a full language for selecting parts of an XML document. Sometimes it is simpler to make selections of this type, but the XPath evaluation in general is more expensive in time and space than a simple combination of arrowsas we've seen it in getDescendends.

Transform external references into absolute references: Refinement

In the above example only A-HREF URLs are edited. Now we extend this to other element-attribute combinations.

mkAbsRefs	:: ArrowXml a => String -> a XmlTree XmlTree
mkAbsRefs base
    = processTopDown ( editRef "a" "href"                 -- (2)
		       >>>
		       editRef "img" "src"                -- (3)
		       >>>
		       editRef "link" "href"              -- (4)
		       >>>
		       editRef "script" "src"             -- (5)
		     )
    where
    editRef en an                                         -- (1)
	= processAttrl ( changeAttrValue (absHRef base)
			 `when`
			 hasName an
		       )
	  `when`
	  ( isElem >>> hasName en )
	where
	absHRef	:: String -> String -> String
	absHRef base url
	    = fromMaybe url . expandURIString url $ base

editRef is parameterized by the element and attribute names. The arrow applied to every element is extended to a sequence of editRef's ((2)-(5)). Notice that the document is still traversed only once. To process all possible HTML elements, this sequence should be extended by further element-attribute pairs.

This can further be simplified into

mkAbsRefs	:: ArrowXml a => String -> a XmlTree XmlTree
mkAbsRefs base
    = processTopDown editRefs
    where
    editRefs
	= foldl (>>>) this
	  .
	  map (\ (en, an) -> editRef en an)
	  $
	  [ ("a", "href")
	  , ("img", "src")
	  , ("link", "href")
	  , ("script", "src")	    -- and more
	  ]
    editRef
      = ...

The foldl (>>>) this is defined in HXT as seqA, so the above code can be simplified to

mkAbsRefs	:: ArrowXml a => String -> a XmlTree XmlTree
mkAbsRefs base
    = processTopDown editRefs
    where
    editRefs
	= seqA . map (uncurry editRef)
	  $
	  ...

More complex examples

Serialization and deserialisation to/from XML

Examples can be found in HXT/Conversion of Haskell data from/to XML

Practical examples of HXT

More complex and complete examples of HXT in action can be found in HXT/Practical

The Complete Guide To Working With HTML

Tutorial and Walkthrough: http://adit.io/posts/2012-04-14-working_with_HTML_in_haskell.html