Editing Regular expressions (section)

=== regex-tdfa ===

Chris Kuklewicz has just released <code>regex-tdfa</code>, (Tagged Deterministic Finite Automata), a new library that works with GHC, the most recent being ghc-6.10.1. It is POSIX compliant and  tested against 
[http://www2.research.att.com/~astopen/testregex/testregex.html the AT&T tests].

This is available on hackage at [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa regex-tdfa] and via [http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/ darcs].

The [http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/doc/html/regex-tdfa/Text-Regex-TDFA.html haddock documentation] is also on the darcs site.

This uses a tagged DFA like the TRE c-library to provide efficient Posix matching.  It also defaults to true Posix submatch capture (including ambiguous *-operator subpatterns), but this extra effort can be disabled.

The versions from 0.90 and up use mutable ST arrays to keep track of data during matching and have thus have both decent speed and decent memory performance.  Previous versions drove the memory use too high, overworking the garbage collector.

The versions from 1.0.0 and up improve the algorithm.  The search time is now O(N) for text of length N in the worst case, while still providing correct POSIX capturing and while running in bounded space.

By disabling submatch capture (see the <code>captureGroups</code> field of <code>ExecOptions</code>) this library avoids the extra work and should run faster ("non capture" case, this is also used if there are no parenthesis in the regular expression).  By running in single line mode (see the <code>CompOptions</code>) and with a leading ^ anchor this library also avoids extra work and should run faster ("front achored" case).  Doing both optimization should run faster still.

Just testing for a match stops at the shortest found match and should be fast (using matchTest or match/mathM for a Bool output), and this also tries to optimize for the "front anchored" case.

The major advantage over pcre is avoidance of exponential blowup for certain patterns: asymptotically, the time required to match a pattern against a string is always linear in length of the string.  This O(N) scaling is [http://archive.fo/LUPTs now achieved] even in the worst case and when returning the correct Posix captures.

As of version 1.1.1 the following GNU extensions are recognized, all anchors:

<pre>
\` at beginning of entire text

\' at end of entire text

\< at beginning of word

\> at end of word

\b at either beginning or end of word

\B at neither beginning nor end of word 
</pre>

The above are controlled by the 'newSyntax' Bool in 'CompOption'.