Editing Regular expressions (section)

== Overview ==

Chris Kuklewicz has developed a regular expression library for Haskell that has been implemented with a variety of backends.  Some of these backends are native Haskell implementations, others are not and rely on external C libraries such libpcre.  New users may feel overwhelmed with the various options that are available to them.  The following table provides an overview of the various features supported by each backend.

There are also a number of alternate or complementary regular expression libs, including:

* Bryan O'Sullivan's text-icu – bindings to the [http://site.icu-project.org/ ICU library], which includes Perl compatible regexes with extended Unicode support. One of few regex libraries working with Text and not String.
* Yoshikuni Jujo's regexpr - Regular expression library like Perl and Ruby's regular expressions
* Don Stewarts's pcre-light - A small, efficient and portable regex library for Perl 5 compatible regular expressions
* Martin Sulzmann's regexpr-symbolic - Equality, containment, intersection among regular expressions via symbolic manipulation
* Matt Morrow's regexqq - A quasiquoter for PCRE regexes
* Uwe Schmidt's hxt-regex-xmlschema - supports full W3C XML Schema regular expressions inclusive all Unicode character sets and blocks. A tutorial is available at [[Regular expressions for XML Schema]].

{| border="1" cellspacing="0" cellpadding="5" 
|+ Feature Matrix of Backends
! Backend
! Grouping?
! POSIX/Perl
! Speed
! Native Impl?
! Stable?
! Lazy?
! Comments
|-
| regex-posix
| Yes
| POSIX
| very slow
| No
| Yes
| No
|
|-
| regex-parsec
| Yes
| POSIX,Perl
| slow
| Yes
| Yes
| ?
|
|-
| regex-tre
| Yes
| POSIX
| fast
| No
| No
| ?
| uses buggy libtre (v0.7.5)
|-
| regex-tdfa
| Yes
| POSIX
| fast
| Yes
| Yes
| Yes
| full Posix compliance
|-
| regex-pcre
| Yes
| Perl
| fast
| No
| Yes
| ?
|
|-
| regex-pcre-builtin
| Yes
| Perl
| fast
| No
| Yes 
| ?
|
|-
| regex-dfa
| No
| POSIX
| fast
| Yes
| Yes
| ?
| 
|-
| regexpr
| Yes
| Perl
| ?
| Yes
| Yes
| ?
| easier for newcomers from other languages; 0.5.1 leaks memory in some cases
|}

'''Note:''' speed is something that should be benchmarked by the actual user, since
the story changes so much with the task, new GHC, compiler flags, etc. 
The algorithm used may be a useful thing (backtracking vs NFA/DFA).

All support String, (Seq Char), ByteString, and (except for regex-posix) ByteString.Lazy.

All are available from [http://hackage.haskell.org/packages/archive/pkg-list.html#cat:Text Hackage] as tar.gz sources and from [http://darcs.haskell.org/packages/regex-unstable/ darcs].

=== regex-base ===

This package exports Text.Regex.Base which re-exports Text.Regex.RegexLike and Text.Regex.Context.  These do not provide the ability to do matching, but provide the type classes which all the other regex-* backends use.

The backend packages also import the utility module Text.Regex.Impl to streamline instance declarations.

The 0.71 version has a "tail" bug in one of the instances of RegexLike:
<haskell>
instance (RegexLike a b) => RegexContext a b (MatchResult b) where 
</haskell>
which I hope to be fixed in ghc 6.6.1.  Getting the unstable version of regex-base also fixes this, though you will have to get and re-compile all the other regex-* modules as well.

The versions of the regex-* backends that come with GHC 6.6 do not re-export the RegexLike classes, so the usage of regex-BACKEND is
<haskell>
import Text.Regex.Base
import Text.Regex.BACKEND
</haskell>
The versions in unstable are being upgraded to re-export RegexLike, so the usage will be simplified to
<haskell>
import Text.Regex.BACKEND
</haskell>

The 0.71 version of regex-base only has Extract instances for [Char] and ByteString.  The unstable version also provides instances of ByteString.Lazy and (Seq Char).  This Extract support must be accompanied by adding support to each regex-* backend in unstable, and that work is in progress.

The RegexMaker class in v0.71 had no way to gracefully report errors in parsing the regular expression itself.  This was a design mistake and has been extended in the unstable version of regex-base to provide monadic version which can fail gracefully.

The RegexLike class only provides support for positions in the source text indexed by Int.  I am still considering how to provide Int64 support.  The best way is probably going to be another class, which will allow for more generalized index type support.  Different backends will, by necessity, have different instances for extended index types.

=== regex-tdfa ===

Chris Kuklewicz has just released <code>regex-tdfa</code>, (Tagged Deterministic Finite Automata), a new library that works with GHC, the most recent being ghc-6.10.1. It is POSIX compliant and  tested against 
[http://www2.research.att.com/~astopen/testregex/testregex.html the AT&T tests].

This is available on hackage at [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa regex-tdfa] and via [http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/ darcs].

The [http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/doc/html/regex-tdfa/Text-Regex-TDFA.html haddock documentation] is also on the darcs site.

This uses a tagged DFA like the TRE c-library to provide efficient Posix matching.  It also defaults to true Posix submatch capture (including ambiguous *-operator subpatterns), but this extra effort can be disabled.

The versions from 0.90 and up use mutable ST arrays to keep track of data during matching and have thus have both decent speed and decent memory performance.  Previous versions drove the memory use too high, overworking the garbage collector.

The versions from 1.0.0 and up improve the algorithm.  The search time is now O(N) for text of length N in the worst case, while still providing correct POSIX capturing and while running in bounded space.

By disabling submatch capture (see the <code>captureGroups</code> field of <code>ExecOptions</code>) this library avoids the extra work and should run faster ("non capture" case, this is also used if there are no parenthesis in the regular expression).  By running in single line mode (see the <code>CompOptions</code>) and with a leading ^ anchor this library also avoids extra work and should run faster ("front achored" case).  Doing both optimization should run faster still.

Just testing for a match stops at the shortest found match and should be fast (using matchTest or match/mathM for a Bool output), and this also tries to optimize for the "front anchored" case.

The major advantage over pcre is avoidance of exponential blowup for certain patterns: asymptotically, the time required to match a pattern against a string is always linear in length of the string.  This O(N) scaling is [http://archive.fo/LUPTs now achieved] even in the worst case and when returning the correct Posix captures.

As of version 1.1.1 the following GNU extensions are recognized, all anchors:

<pre>
\` at beginning of entire text

\' at end of entire text

\< at beginning of word

\> at end of word

\b at either beginning or end of word

\B at neither beginning nor end of word 
</pre>

The above are controlled by the 'newSyntax' Bool in 'CompOption'.

=== regex-posix ===

See [[Regex Posix]] for bug reports relating to your operating system.

This backend provides a Haskell interface for the "posix" c-library that comes with most operating systems, and is provided by '''include "regex.h"'''. This c-library probably also drives command line utilities such as sed and grep.

"Posix" is in quotes since it is often not fully Posix compliant and may be buggy (as on OS X, where the bug also affects sed).

And the c-library has what I call impossibly-slow performance, as in at least 100x slower than other regex engines.

The goal of regex-tdfa is to create a replacement for regex-posix to accompany a future version of GHC.

=== regex-compat ===

This takes regex-posix and presents a Text.Regex api that mirrors the one that came with GHC 6.4.x for compatibility.

=== regex-pcre ===

This wraps the pcre c-library from http://www.pcre.org and gives all the Perl regular expression syntax you might want.  This is especially efficient with Data.ByteString.

=== regex-pcre-builtin ===

This is the same as regex-pcre, but comes bundled with a version of the pcre C library.

=== pcre-light ===

Another FFI binding to PCRE; Don Stewart's pcre-light is intended to be "A light regular expression library, using Perl 5 compatible regexes", with support for Strings and strict ByteStrings. It is available on [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/pcre-light Hackage], or through a [http://code.haskell.org/~dons/code/pcre-light/ Darcs repo]; see also the [http://www.haskell.org/pipermail/haskell/2008-January/020120.html original announcement].

=== regex-tre ===

This wraps the TRE c-library from http://laurikari.net/tre/ which provides Posix regular expressions.  The current 0.7.5 version of TRE is buggy, however, so you should avoid this library unless you test your regular expressions to ensure you avoid the bugs.

The author of TRE is currently working to fix these bugs.

Also, the Posix compliance in 0.7.5 fails for *-operators with ambiguous captures.

=== regex-dfa ===

This is the only LGPL backend, as it is derived from [http://www.cse.unsw.edu.au/~chak/haskell/ctk/ CTKLight].  The stable version has had a bad bug on *-operators around patterns that could accept zero characters.  The unstable version will have this issue fixed.

This library provides no submatch captures, but is very fast at finding the Posix leftmost longest match.

=== regex-parsec ===

This backend can either find the left-biased match like Perl or the longest match like Posix.  It uses Parsec as a backtracking matching engine and is quite slow.

The submatches returned in longest match mode maximize the length of the captured texts instead of the subpatterns, and this is a divergence from the Posix specification.