https://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&feed=atom&action=historyRegular expressions for XML Schema - Revision history2015-12-01T03:47:53ZRevision history for this page on the wikiMediaWiki 1.19.14+dfsg-1https://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=36933&oldid=prevUweSchmidt at 16:46, 1 October 20102010-10-01T16:46:24Z<p></p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 16:46, 1 October 2010</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=36920&oldid=prevUweSchmidt at 16:22, 1 October 20102010-10-01T16:22:29Z<p></p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 16:22, 1 October 2010</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=36919&oldid=prevUweSchmidt at 16:20, 1 October 20102010-10-01T16:20:37Z<p></p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 16:20, 1 October 2010</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=36915&oldid=prevUweSchmidt at 15:49, 1 October 20102010-10-01T15:49:18Z<p></p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 15:49, 1 October 2010</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=36913&oldid=prevUweSchmidt at 15:38, 1 October 20102010-10-01T15:38:49Z<p></p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 15:38, 1 October 2010</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=31547&oldid=prevUweSchmidt at 16:08, 12 November 20092009-11-12T16:08:33Z<p></p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 16:08, 12 November 2009</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=26226&oldid=prevUweSchmidt at 12:08, 1 February 20092009-02-01T12:08:32Z<p></p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 12:08, 1 February 2009</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=26225&oldid=prevUweSchmidt: interleave example added2009-02-01T11:59:01Z<p>interleave example added</p>
<table class='diff diff-contentalign-left'>
<tr valign='top'>
<td colspan='1' style="background-color: white; color:black;">← Older revision</td>
<td colspan='1' style="background-color: white; color:black;">Revision as of 11:59, 1 February 2009</td>
</tr></table>UweSchmidthttps://wiki.haskell.org/index.php?title=Regular_expressions_for_XML_Schema&diff=26050&oldid=prevUweSchmidt: Initial version of regex-xmlschema page2009-01-23T10:44:56Z<p>Initial version of regex-xmlschema page</p>
<p><b>New page</b></p><div>[[Category:Text]]<br />
[[Category:XML]]<br />
[[Category:Tools]]<br />
[[Category:Tutorials]]<br />
<br />
== Regular expressions for XML Schema ==<br />
<br />
The [http://www.w3.org/TR/xmlschema11-2/#regexs W3C XML Schema] specification defines a language for regular expressions. This language is used<br />
in in the XML Schema spec when defining the data type library part. The [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-xmlschema regex-xmlschema]<br />
package contains a complete implementation of this spec. It is implemented with the technique of derivations of regular expression. Main features are<br />
full support of Unicode including all Unicode codeblocks and character properties, purely functional interface, extensions for intersection,<br />
set difference and<br />
exclusive or of regular sets (regular expressions), extensions for subexpression matches, interface for matching, stream like editing and tokenizing.<br />
<br />
__TOC__<br />
<br />
== Motivation ==<br />
<br />
When developing the RelaxNG schema validator in the [http://www.fh-wedel.de/~si/HXmlToolbox/index.html Haskell XML Toolbox (HXT)]<br />
there was the need for a complete regular expression matcher for the W3C XML Schema regular expression syntax.<br />
The string representation of basic data types in XML Schema as well as in the RelaxNG standard can be defined by regular expressions.<br />
The available Haskell libraries for processing of regular expressions where not applicable for this task, e.g. they used other r.e. grammars or<br />
did not support Unicode.<br />
<br />
When implementing the DTD validator and the RelaxNG validator in HXT, we used the rather old idea of derivations of regular expression.<br />
This is a technique to make the regular expressions operational in a direct way without the clumsy construction of a finite state machine. This worked<br />
fine for validating the content model of XML elements and the implementations were very compact and sufficiently efficient.<br />
So we could expect it to also work fine for Unicode.<br />
<br />
The XML Schema grammar is well designed, so the transformation of this grammar in a parsec parser was straight forward,<br />
the interesting part was the internal data structure and it's processing. When completing the work for HXT it showed up,<br />
that this library is generally useful, not only for XML validation. And with some rather simple extensions it became<br />
possible to not only use this for matching strings, as required in HXT, but also for sed like stream editing and for<br />
easy construction of lightweight scanners and tokenizers. This was the motivation for doing a stand alone version of<br />
that regular expression library.<br />
<br />
== Resources ==<br />
<br />
; [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-xmlschema Hackage download]<br />
; [http://darcs2.fh-wedel.de/hxt/regex/ darcs repository with head revision of regex-xmlschema]<br />
<br />
== The idea of derivations of regular expressions ==<br />
<br />
The idea of derivations of regular expression was developed by Janusz A. Brzozowski, Princeton Univ. in 1964.<br />
Goal was to perform the test whether a string '''w''' is a word of a given regular set (given by a regular expression '''r''')<br />
by manipulating the (internal tree like representation of the) regular expression '''r'''.<br />
<br />
The word test is based on two functions. The first, here called '''nullable''' checks, whether the empty word &epsilon;<br />
is contained in the regular set associated with '''r'''. This test can easily be done and it can be done efficiently.<br />
<br />
Given an expression '''r''' and a single char '''x''' the second function, here called '''delta''' computes the so called<br />
derivative of '''r''' with respect to '''x'''. The derivative '''delta r x''' is again a regular expression.<br />
The following law (in Haskell like notation) must hold for '''delta r x''':<br />
<br />
<haskell><br />
match r (x:xs)<br />
<=><br />
match (delta r x) xs <br />
</haskell><br />
<br />
Brzozowski has shown that this derivative exists, and that it is simple to construct it.<br />
<br />
Let's see how regular Expression can be modelled with Haskell data types and how<br />
'''nullable''' and '''delta''' work. In the example we will work with the fixed alphabet of Haskell chars.<br />
<br />
<haskell><br />
data Regex = Zero -- {}<br />
| Unit -- {&epsilon;}<br />
| Sym Char -- {a}<br />
| Star Regex -- r*<br />
| Seq Regex Regex -- r1 . r2<br />
| Alt Regex Regex -- r1 | r2<br />
</haskell><br />
<br />
In this first version we use the minimal set of regular expressions:<br />
The empty set, the set containing &epsilon; and the single char sets<br />
form the primitive sets, '''Star''' is the repetition, '''Seq''' the concatenation<br />
and '''Alt''' the choice operator.<br />
<br />
'''nullable''' is defined like this, so it's easy and efficient to compute this predicate:<br />
<br />
<haskell><br />
nullable :: Regex -> Bool<br />
<br />
nullable Zero = False<br />
nullable Unit = True<br />
nullable (Sym a) = False<br />
nullable (Star r) = True<br />
nullable (Seq r1 r2) = nullable r1<br />
&& nullable r2<br />
nullable (Alt r1 r2) = nullable r1<br />
|| nullable r2<br />
<br />
</haskell><br />
<br />
We see for the three simple cases that only '''Unit''' is nullable,<br />
'''r*''' contains per definition the empty word. A sequence contains the empty word<br />
only in case where both parts are nullable, a union is nullable when at least one operand<br />
is nullable.<br />
<br />
For '''delta''' we've again 6 cases:<br />
<br />
<haskell><br />
delta :: Regex -> Char -> Regex<br />
<br />
delta Zero x = Zero<br />
<br />
delta Unit x = Zero<br />
<br />
delta (Sym y) x<br />
| x == y = Unit<br />
| otherwise = Zero<br />
<br />
delta (Star r) x = Seq (delta r x) (Star r)<br />
<br />
delta (Seq r1 r2) x<br />
| nullable r1 = Alt dr1 dr2<br />
| otherwise = dr1<br />
where<br />
dr1 = Seq (delta r1 x) r2<br />
dr2 = delta r2 x<br />
<br />
delta (Alt r1 r2) x = Alt (delta r1 x) (delta r2 x)<br />
</haskell><br />
<br />
'''delta''' can be viewed as a parser that can accept a single character and delivers a new parser.<br />
'''delta''' fails, when the parser (the regular expression) is '''Zero''' or '''Unit'''. A '''Sym''' parser<br />
checks the input against the required character and fails ('''Zero''') or results in '''Unit''' (the EOF parser).<br />
A '''r*''' expression is expanded into a sequence '''r&sdot;r*''', and the input character<br />
is parsed with the simple parser '''r'''.<br />
<br />
The most complicated rule is the rule for '''Seq r1 r2'''. There are two cases. The simple one is<br />
that '''r1''' is not '''nullable'''. In this case the input must be consumed by '''r1'''. In the second case<br />
('''r1''' is '''nullable''')<br />
there are to choices: The input could be consumed by '''r1''', but it also could be consumed by '''r2'''<br />
when '''r1''' only accepts the empty word.<br />
The '''Alt r1 r2''' rule is defined such that both '''r1''' and '''r2''' are run in parallel. Here is the<br />
point where the nondeterminism is implemented.<br />
<br />
<br />
'''delta''' can easily be expanded on strings<br />
<br />
<haskell><br />
deltaS :: Regex -> String -> Regex<br />
deltaS = foldl delta<br />
</haskell><br />
<br />
Combining '''deltaS''' with '''nullable''' gives us a simple matching function<br />
<br />
<haskell><br />
match :: Regex -> String -> Bool<br />
match re = nullable . deltaS re<br />
</haskell><br />
<br />
These are the essential facts from theory, but is this approach practically applicable?<br />
<br />
== Making the theory practically applicable ==<br />
<br />
The above shown code runs, but it just runs for toy examples. When checking words<br />
with this simple version of '''delta''' space and time can grow exponentially with the length of<br />
the input. A simple example is the stepwise derivation of '''a*''' with '''a'''-s as input<br />
<br />
<haskell><br />
deltaS (Star (Sym 'a')) "a" = Seq Unit (Star (Sym 'a'))<br />
deltaS (Star (Sym 'a')) "aa" = Alt (Seq Zero (Star (Sym 'a')))<br />
(Seq Unit (Star (Sym 'a')))<br />
deltaS (Star (Sym 'a')) "aaa" = Alt (Seq Zero (Star (Sym 'a')))<br />
(Alt (Seq Zero (Star (Sym 'a')))<br />
(Seq Unit (Star (Sym 'a'))))<br />
...<br />
</haskell><br />
<br />
The problem here is that within '''delta''' the derivations become more complex.<br />
This happens in two places: In the rule for '''Star''', where '''r*''' becomes '''r&sdot;r*'''<br />
and in the '''Seq''' rule, when '''r1''' is nullable. In this case a choice is introduced.<br />
<br />
The solution for this problem is, like in symbolic algebra systems, the introduction<br />
of a simplification step after derivation. We easily see that '''Seq Unit r2''' is<br />
the same as '''r2'''. This rule is applicable in the 1. derivation. Furthermore<br />
'''Seq Zero r2''' is equivalent to '''Zero''' (failure remains failure). A third effective rule<br />
is: '''Alt Zero r2''' equals '''r2'''.<br />
<br />
These and some more simplification rules can be added by introducing smart constructors:<br />
<br />
<haskell><br />
mkSeq :: Regex -> Regex -> Regex<br />
mkSeq Zero r2 = Zero<br />
mkSeq r1 Zero = Zero<br />
mkSeq Unit r2 = r2<br />
mkSeq r1 r2 = Seq r1 r2<br />
<br />
mkAlt Zero r2 = r2<br />
mkAlt r1 Zero = r1<br />
mkAlt r1 r2 = Alt r1 r2<br />
</haskell><br />
<br />
With these simplification rules the resulting regular expressions remain constant in size when<br />
successively derive an expression. Furthermore the simplification rules run in constant time.<br />
So space as well as runtime remains proportional to the length of the input.<br />
<br />
== Extension: Operators for intersection, complement, difference and exclusive or ==<br />
<br />
=== Implementation of other operators on regular sets and regular expressions ===<br />
<br />
To check, whether a word '''w''' is in the union of two regular sets represented by expressions<br />
'''r1''' and '''r2''', we apply '''delta''' to both '''r1''' and '''r2'''. The real test is then done within<br />
'''nullable''' and there it's done by a simple logical OR operation. This observation leads to the<br />
question, whether we could implement other binary operations on regular set. And indeed this can be done<br />
for all binary operations. A test whether a word '''w''' is in the intersection of two regular sets<br />
can be added just by adding the operator to the '''Regex''' data type, add a rule for '''delta''' with<br />
precisely the same structure as for '''Alt''' and add the predicate with the corresponding logical<br />
operation to '''nullable'''. Here's the example for intersection:<br />
<br />
<haskell><br />
data Regex = ...<br />
| ...<br />
| Isect Regex Regex<br />
<br />
nullable (Isect r1 r2) = nullable r1<br />
&& nullable r2<br />
<br />
delta (Isect r1 r2) x = mkIsect (delta r1 x) (delta r2 x)<br />
<br />
<br />
-- smart constructor with some simplification rules<br />
<br />
mkIsect Zero r2 = Zero<br />
mkIsect r1 Zero = Zero<br />
mkIsect Unit r2 = Unit<br />
mkIsect r1 Unit = Unit<br />
...<br />
mkIsect r1 r2 = Isect r1 r2<br />
</haskell><br />
<br />
So adding intersection, difference or other set operations are done by<br />
adding about 10 lines of code.<br />
<br />
=== Syntax extensions for new operators ===<br />
<br />
When extending the XML Schema regular expression syntax, there was one leading<br />
principle: All legal regular expressions should remain correct and their semantics<br />
should not change. In the concrete syntax the curly braces are special symbols<br />
and they are only allowed in so called quantifiers, postfix operators that specify<br />
a kind of repetition, e..g. '''a{5,7}''' stand for 5,6 or 7 a's.<br />
So operators in curly braces, e.g. '''{&}''' for intersection,<br />
is an extension that does not conflict with the standard syntax.<br />
The following new operators are added: '''{&}''' for intersection '''{\}''' for set difference and<br />
'''{^}''' for exclusive or, here enumerated with decreasing priority.<br />
<br />
Examples:<br />
<br />
; '''.*a.*{&}.*b.*''' all words containing at least one '''a''' and one '''b'''<br />
; '''[a-z]+{\}bush''' all names but not '''bush'''<br />
; '''.*a.*{^}.*b.*''' all words containing at least one '''a''' or one '''b''' but not both an '''a''' and a '''b'''<br />
<br />
A complement operator can be formulated by using the set difference operator.<br />
The first attempt '''.*{\}bush''' (everything but not '''bush''') fails. The '''.''' in XML Schema syntax<br />
is a shortcut for every character but newline (\n) and carriage return (\r). So '''.|\n|\r''' is<br />
the whole alphabet and '''(.|\n|\r)*''' all words over the alphabet. This makes the<br />
complement expression a bit clumsy: '''(.|\n|\r)*{\}bush'''. There are some character escape sequences<br />
for character sets, e..g. '''\s''' for whitespace, '''\i''' for XML name start characters, '''\d''' for<br />
digits and others. This list has been extended by '''\a''' for '''.|\n|\r''' and '''\A''' for '''\a*'''.<br />
With this new '''multiCharEsc''' spec the above complement expression becomes '''\A{\}bush'''.<br />
<br />
=== Examples using the extended syntax ===<br />
<br />
==== Substitution of none greedy operators ====<br />
<br />
In Perl and other libraries there are so called none greedy repetition operators.<br />
These are not present in the W3C XML Schema syntax. But for many real world examples these<br />
none greedy expressions can be reformulated with the use of set difference.<br />
A classical example is a regular expression for comments, which are delimited by character<br />
sequences, like in C with '''/*''' and '''*/'''. The naive approach '''/[*].*[*]/''' does<br />
not work. A word like '''/*abc*/123*/''' is not a C comment, but it matches the above given expression.<br />
<br />
The solution with this library is an expression like the following:<br />
<br />
'''/[*](\A{\}(\A[*]/\A))[*]/'''<br />
<br />
in words: The contents of a C (multi line) comment is every word, that does not contain a subsequence<br />
of '''*/'''<br />
<br />
==== Identifiers except keywords ====<br />
<br />
In most scanner specs, the regular expressions for names and keywords overlap and<br />
the sequence of the rules in the scanner spec becomes important for solving this<br />
ambiguity. With the set difference it becomes simple to exclude keywords from the regular expression for identifier.<br />
<br />
A simple example:<br />
<br />
'''[a-z][a-z0-9]*{\}(if|then|else|while|do)'''<br />
<br />
excludes the 5 keywords from the set of identifiers.<br />
<br />
==== Permutations ====<br />
<br />
With the use of the intersection operator '''{&}''' it is rather easy to formulate<br />
a regular expression for the permutations of a character set.<br />
<br />
'''.*a.*{&}.*b.*{&}.*c.*{&}.{3}'''<br />
<br />
is an expressions for all permutations of '''a''', '''b''' and '''c'''.<br />
<br />
== Extension: Matching of subexpressions ==<br />
<br />
This library supports matching of subexpressions like in Perl, but the syntax for subexpressions<br />
is different. Labeling subexpressions is not done implicitly by counting and numbering the pairs of parentheses, but <br />
the parentheses of interest can be labeled with a name. This is more flexible and less<br />
error prone when extending the regular expressions.<br />
<br />
Here's an example for searching a date pattern in YYYY-MM-DD format in a line of text and<br />
in case of success giving back the strings for the year, month and day:<br />
<br />
'''.*({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2}).*'''<br />
<br />
The three pairs of parentheses are labeled '''y''', '''m''' and '''d'''. Let's see<br />
what function we call with this pattern and how the result looks like.<br />
We've already seen the '''match''' function giving a back a Boolean for the match result.<br />
This is too less information in this case. We need an extended form of match called '''matchSubex'''<br />
with the following signature:<br />
<br />
<haskell><br />
matchSubex :: String -> String -> [(String, String)]<br />
matchSubex re input = ...<br />
</haskell><br />
<br />
=== The '''matchSubex''' function ===<br />
Extracting the date from a single line of output from a Unix '''ls -l''' command can be done with the following<br />
code:<br />
<br />
<haskell><br />
getDate [(String,String)] -> Maybe (Int, Int, Int)<br />
getDate [("y",y),("m",m),("d",d)] = Just (read y, read m, read d)<br />
getDate _ = Nothing<br />
<br />
getDate . matchSubex ".*({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2}).*"<br />
$ "-rw-r--r-- 1 uwe users 2264 2008-11-19 15:36 Main.hs"<br />
<br />
=> Just (2008,11,19)<br />
</haskell><br />
<br />
The '''matchSubex''' function returns a list of label-value pairs for the subexpression matches.<br />
The label-value pairs occur in the same sequence as in the regular expression. If there is no match<br />
or there is no labeled subexpression the result is the empty list.<br />
Nesting of labeled subexpressions is possible. Examples:<br />
<br />
<haskell><br />
matchSubex ".*({date}({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2})).*"<br />
$ "-rw-r--r-- 1 uwe users 2264 2008-11-19 15:36 Main.hs"<br />
<br />
=> [("date","2008-11-19"),("y","2008"),("m","11"),("d",19")]<br />
</haskell><br />
<br />
=== Matching with subexpressions and nondeterminism ===<br />
<br />
When writing regular expressions with labeled subexpressions there are some traps, if this<br />
is not done carefully. When an expression is nondeterministic and contains labeled subexpressions,<br />
all matches for these subexpressions are computed.<br />
The number of matches can grow exponentially, so the runtime also<br />
can grow exponentially.<br />
<br />
Here is a simple example demonstrating this situation.<br />
<br />
<haskell><br />
matchSubex "(({l}x+))*"$ "xx"<br />
<br />
=> [("l","xx"),("l","x"),("l","x")]<br />
</haskell><br />
<br />
The expression is '''(x+)*''' and<br />
the '''x+''' is labeled with '''l'''. The match can be done in two ways.<br />
The first one is: apply the *-expression once and take '''xx''' as the matching subexpression,<br />
the second is apply the *-expression two times and take tow times '''x''' as matching subexpressions.<br />
This gives 3 matches for label '''l'''. The situation becomes much more complicated for longer input.<br />
<br />
There is one extension for resolving nondeterministic results with subexpressions.<br />
As an example, let's match a text with identifiers or keywords. This could be done as follows<br />
<br />
<haskell><br />
matchSubex "({name}[a-z][a-z0-9]*)|({keyword}if|then|else|while|do)" "abc"<br />
<br />
=> [("name","abc")]<br />
<br />
matchSubex "({name}[a-z][a-z0-9]*)|({keyword}if|then|else|while|do)" "else"<br />
<br />
=> [("name","else"),("keyword", "else")]<br />
</haskell><br />
<br />
In the 2. test we would like to detect the '''else''' as a keyword. This could be done<br />
by subtracting all keywords from the subexpressions for name. but this make the expression<br />
unnecessarily complicated. There is an operator '''{|}''' that works like set union, but for<br />
subexpressions it's not symmetric. If the left hand side matches, only these results are taken<br />
the results from the right hand are ignored. So the above example can be rewritten to prioritize<br />
the keywords as follows:<br />
<br />
<haskell><br />
matchSubex "({keyword}if|then|else|while|do){|}({name}[a-z][a-z0-9]*)" "abc"<br />
<br />
=> [("name","abc")]<br />
<br />
matchSubex "({keyword}if|then|else|while|do){|}({name}[a-z][a-z0-9]*)" "else"<br />
<br />
=> [("keyword", "else")]<br />
</haskell><br />
<br />
== Examples for editing ==<br />
<br />
'''sed''' is a function for stream editing of substrings matching a regular expressions<br />
<br />
<haskell><br />
sed :: (String -> String) -> String -> String -> String<br />
sed edit regex input = ...<br />
<br />
-- examples<br />
<br />
sed (const "b") "a" "xaxax" => "xbxbx"<br />
sed (\ x -> x ++ x) "a" "xax" => "xaax"<br />
sed jandl "l|r" "left or right" => "reft ol light"<br />
<br />
-- with<br />
<br />
jandl "l" = "r"<br />
jandl "r" = "l"<br />
<br />
</haskell><br />
<br />
== Examples for tokenizing ==<br />
<br />
The '''tokenize''' function can be used for constructing simple tokenizers.<br />
It is recommended to use regular expressions where the empty word does not match.<br />
Else there will appear a lot of probably useless empty tokens in the output.<br />
All none matching chars are discarded.<br />
<br />
Here are some test cases:<br />
<br />
<haskell><br />
tokenize :: String -> String -> [String]<br />
tokenize regex string = ...<br />
<br />
-- example runs<br />
<br />
tokenize "a" "aabba" => ["a","a","a"]<br />
tokenize "a*" "aaaba" => ["aaa","a"]<br />
tokenize "a*" "bbb" => ["","",""]<br />
tokenize "a+" "bbb" => []<br />
<br />
tokenize "[a-z]{2,}|[0-9]{2,}|[0-9]+[.][0-9]+"<br />
"ab123 456.7abc"<br />
=> ["ab","123","456.7","abc"]<br />
<br />
tokenize "[^ \t\n\r]*"<br />
"abc def\t\n\rxyz"<br />
=> ["abc","def","xyz"]<br />
<br />
tokenize ".*"<br />
"\nabc\n123\n\nxyz\n"<br />
= ["","abc","123","","xyz"]<br />
<br />
tokenize ".*" = lines<br />
<br />
tokenize "[^ \t\n\r]*" = words<br />
</haskell><br />
<br />
There are two more tokenizer functions, '''tokenize' ''' does not throw<br />
away the none matching characters, and '''tokenizeSubex''' works with labeled<br />
subexpressions, so the resulting words can be labeled and can further be processed<br />
with respect to that label.<br />
<br />
<haskell><br />
tokenizeSubex ( "({keyword}if|then|else|while|do)"<br />
++ "{|}" ++<br />
"({name}[a-z][a-z0-9]*)"<br />
++ "|" ++<br />
"({num}[0-9]+)"<br />
++ "|" ++<br />
"({op}==|/=|:=|[+])"<br />
)<br />
"if abc /= 42 then abc := 42"<br />
=><br />
[ ("keyword", "if" )<br />
, ("name", "abc" )<br />
, ("op", "/=" )<br />
, ("num", "42" )<br />
, ("keyword", "then" )<br />
, ("name", "abc" )<br />
, ("op", ":=" )<br />
, ("num", "42" )<br />
]<br />
</haskell><br />
<br />
== Performance ==<br />
<br />
A simple performance test shows that it's possible to process even large<br />
data rather efficiently. The simple test does the following: A large text file is generated,<br />
in the example run with 2^25 characters (about 33Mb). This file is copied by reading and writing the complete file<br />
to get a figure about the time spend in IO.<br />
The second test reads in the file splits it up into lines with the predefined '''line''' function<br />
and writes the lines out again. This is compared with a line function defined as<br />
'''tokenize ".*"'''. The same comparison is done with the built in '''words''' and<br />
'''tokenize "\\S+"'''.<br />
<br />
The following runtimes in second where measured:<br />
<br />
; &nbsp;1.33 copy file<br />
; &nbsp;8.48 split into lines with lines from prelude<br />
; 10.93 split into lines with regex tokenize<br />
; 20.44 split into words with words from prelude<br />
; 39.75 split into words with regex tokenize<br />
<br />
So the overhead introduced by repeated computation of derivation is<br />
not too bad. Of course this throughput can not be expected by more complex<br />
regular expressions.</div>UweSchmidt