Difference between revisions of "Regular expressions"

Revision as of 15:17, 16 December 2014

Overview

Chris Kuklewicz has developed a regular expression library for Haskell that has been implemented with a variety of backends. Some of these backends are native Haskell implementations, others are not and rely on external C libraries such libpcre. New users may feel overwhelmed with the various options that are available to them. The following table provides an overview of the various features supported by each backend.

There are also a number of alternate or complementary regular expression libs, including:

Bryan O'Sullivan's text-icu – bindings to the ICU library, which includes Perl compatible regexes with extended Unicode support. One of few regex libraries working with Text and not String.
Yoshikuni Jujo's regexpr - Regular expression library like Perl and Ruby's regular expressions
Don Stewarts's pcre-light - A small, efficient and portable regex library for Perl 5 compatible regular expressions
Martin Sulzmann's regexpr-symbolic - Equality, containment, intersection among regular expressions via symbolic manipulation
Matt Morrow's regexqq - A quasiquoter for PCRE regexes
Uwe Schmidt's hxt-regex-xmlschema - supports full W3C XML Schema regular expressions inclusive all Unicode character sets and blocks. A tutorial is available at Regular expressions for XML Schema.

Feature Matrix of Backends
Backend	Grouping?	POSIX/Perl	Speed	Native Impl?	Stable?	Lazy?	Comments
regex-posix	Yes	POSIX	very slow	No	Yes	No
regex-parsec	Yes	POSIX,Perl	slow	Yes	Yes	?
regex-tre	Yes	POSIX	fast	No	No	?	uses buggy libtre (v0.7.5)
regex-tdfa	Yes	POSIX	fast	Yes	Yes	Yes	full Posix compliance
regex-pcre	Yes	Perl	fast	No	Yes	?
regex-pcre-builtin	Yes	Perl	fast	No	Yes	?
regex-dfa	No	POSIX	fast	Yes	Yes	?
regexpr	Yes	Perl	?	Yes	Yes	?	easier for newcomers from other languages; 0.5.1 leaks memory in some cases

Note: speed is something that should be benchmarked by the actual user, since the story changes so much with the task, new GHC, compiler flags, etc. The algorithm used may be a useful thing (backtracking vs NFA/DFA).

All support String, (Seq Char), ByteString, and (except for regex-posix) ByteString.Lazy.

All are available from Hackage as tar.gz sources and from darcs.

regex-base

This package exports Text.Regex.Base which re-exports Text.Regex.RegexLike and Text.Regex.Context. These do not provide the ability to do matching, but provide the type classes which all the other regex-* backends use.

The backend packages also import the utility module Text.Regex.Impl to streamline instance declarations.

The 0.71 version has a "tail" bug in one of the instances of RegexLike:

instance (RegexLike a b) => RegexContext a b (MatchResult b) where

which I hope to be fixed in ghc 6.6.1. Getting the unstable version of regex-base also fixes this, though you will have to get and re-compile all the other regex-* modules as well.

The versions of the regex-* backends that come with GHC 6.6 do not re-export the RegexLike classes, so the usage of regex-BACKEND is

import Text.Regex.Base
import Text.Regex.BACKEND

The versions in unstable are being upgraded to re-export RegexLike, so the usage will be simplified to

import Text.Regex.BACKEND

The 0.71 version of regex-base only has Extract instances for [Char] and ByteString. The unstable version also provides instances of ByteString.Lazy and (Seq Char). This Extract support must be accompanied by adding support to each regex-* backend in unstable, and that work is in progress.

The RegexMaker class in v0.71 had no way to gracefully report errors in parsing the regular expression itself. This was a design mistake and has been extended in the unstable version of regex-base to provide monadic version which can fail gracefully.

The RegexLike class only provides support for positions in the source text indexed by Int. I am still considering how to provide Int64 support. The best way is probably going to be another class, which will allow for more generalized index type support. Different backends will, by necessity, have different instances for extended index types.

regex-tdfa

Chris Kuklewicz has just released regex-tdfa, (Tagged Deterministic Finite Automata), a new library that works with GHC, the most recent being ghc-6.10.1. It is POSIX compliant and tested against the AT&T tests.

This is available on hackage at regex-tdfa and via darcs.

The haddock documentation is also on the darcs site.

This uses a tagged DFA like the TRE c-library to provide efficient Posix matching. It also defaults to true Posix submatch capture (including ambiguous *-operator subpatterns), but this extra effort can be disabled.

The versions from 0.90 and up use mutable ST arrays to keep track of data during matching and have thus have both decent speed and decent memory performance. Previous versions drove the memory use too high, overworking the garbage collector.

The versions from 1.0.0 and up improve the algorithm. The search time is now O(N) for text of length N in the worst case, while still providing correct POSIX capturing and while running in bounded space.

By disabling submatch capture (see the captureGroups field of ExecOptions) this library avoids the extra work and should run faster ("non capture" case, this is also used if there are no parenthesis in the regular expression). By running in single line mode (see the CompOptions) and with a leading ^ anchor this library also avoids extra work and should run faster ("front achored" case). Doing both optimization should run faster still.

Just testing for a match stops at the shortest found match and should be fast (using matchTest or match/mathM for a Bool output), and this also tries to optimize for the "front anchored" case.

The major advantage over pcre is avoidance of exponential blowup for certain patterns: asymptotically, the time required to match a pattern against a string is always linear in length of the string. This O(N) scaling is now achieved even in the worst case and when returning the correct Posix captures.

As of version 1.1.1 the following GNU extensions are recognized, all anchors:

\` at beginning of entire text

\' at end of entire text

\< at beginning of word

\> at end of word

\b at either beginning or end of word

\B at neither beginning nor end of word

The above are controlled by the 'newSyntax' Bool in 'CompOption'.

regex-posix

See Regex Posix for bug reports relating to your operating system.

This backend provides a Haskell interface for the "posix" c-library that comes with most operating systems, and is provided by include "regex.h". This c-library probably also drives command line utilities such as sed and grep.

"Posix" is in quotes since it is often not fully Posix compliant and may be buggy (as on OS X, where the bug also affects sed).

And the c-library has what I call impossibly-slow performance, as in at least 100x slower than other regex engines.

The goal of regex-tdfa is to create a replacement for regex-posix to accompany a future version of GHC.

regex-compat

This takes regex-posix and presents a Text.Regex api that mirrors the one that came with GHC 6.4.x for compatibility.

regex-pcre

This wraps the pcre c-library from http://www.pcre.org and gives all the Perl regular expression syntax you might want. This is especially efficient with Data.ByteString.

regex-pcre-builtin

This is the same as regex-pcre, but comes bundled with a version of the pcre C library.

pcre-light

Another FFI binding to PCRE; Don Stewart's pcre-light is intended to be "A light regular expression library, using Perl 5 compatible regexes", with support for Strings and strict ByteStrings. It is available on Hackage, or through a Darcs repo; see also the original announcement.

regex-tre

This wraps the TRE c-library from http://laurikari.net/tre/ which provides Posix regular expressions. The current 0.7.5 version of TRE is buggy, however, so you should avoid this library unless you test your regular expressions to ensure you avoid the bugs.

The author of TRE is currently working to fix these bugs.

Also, the Posix compliance in 0.7.5 fails for *-operators with ambiguous captures.

regex-dfa

This is the only LGPL backend, as it is derived from CTKLight. The stable version has had a bad bug on *-operators around patterns that could accept zero characters. The unstable version will have this issue fixed.

This library provides no submatch captures, but is very fast at finding the Posix leftmost longest match.

regex-parsec

This backend can either find the left-biased match like Perl or the longest match like Posix. It uses Parsec as a backtracking matching engine and is quite slow.

The submatches returned in longest match mode maximize the length of the captured texts instead of the subpatterns, and this is a divergence from the Posix specification.

Documentation

Coming soonish. There is also a great tutorial for using (=~) and (=~~) at this blog post.

A commentary on regex-posix bugs has been started at Regex_Posix in support of the regex-posix-unittest package.

A commentary on the design and internals of such Posix engines has been started at Regex_TDFA mainly describing the regex-tdfa package.

Link to article on DFA

There is an article by Russ Cox on Thompson Non-Finite Automata which presents the DFA algorithm and shows it to be faster than backtracking (the method used in perl, ruby, python).

The original version of the above article is slightly incomplete advocacy. I will explain in the (apples|oranges) section below ChrisKuklewicz 15:56, 30 January 2007 (UTC)

(apple|orange)

As Haskell is all about making data more strongly typed, I want to make the point that there are two popular Types of regular expressions in existence. Just as all String's are not the same Type, such as some may be escaped or encoded versions or hold data printed in different formats, a regular expressions like "a?a" means two different things depending on its actual Type.

And this difference is very close to another thing that Haskell and functional programming emphasize : declarative programming (what) as opposed to imperative programming (how).

I will call the two Types of regular expressions Posix and Perl.

Posix Regular Expressions: This is the declarative approach to regular expressions. The correct match of a regexp to a string of text starts at the leftmost possible position; there are no valid matches that can start before it. Of the possible matches that start at this leftmost position, the one that matches the longest substring is the correct match.; How this match is found is immaterial to this definition of the correct match.; Here and for the rest for the rest of this page I mean Posix to be modern "Posix Extended Regular Expressions" and never the older "Posix Basic Regular Expressions".

Perl Regular Expressions: This is the imperative approach to regular expressions. The correct match of a regexp to a string of text starts at the leftmost possible position; there are no valid matches that can start before it. To choose the correct match that starts at this position match parts of the regexp against the text until the first match is found. Specifically you must try left branches of '|' before right branches, and treat '?' '*' and '+' as greedy. Greedy means to match as many iterations, and only backing off the number of repetitions if no complete match is possible.; The first match found may not be the longest, and it may not be the shortest. It is the left-biased choice.; This definition of a correct match is identical description of how a backtracking implementation would operate.

To find a Perl match you usually do what Perl (and Python, and Java, etc.) do, which is try the left branches and greedy repetition first, then backtrack until the first answer is found. The number of path is the multiplication of alternatives. So "a?" repeated 'n' times has 2ⁿ paths.

To find a Posix match you must try all possible matches from a starting point and pick the longest, moving on the next starting point if there was not match. If you must backtrack and check each path separately this will take exponential time. To avoid this you construct an automata (NFA or DFA) and then you can check "a?" repeated n times in polynomial time (I think it is O(n²) for NFA, O(n) for DFA). You don't have to use an automata; you can write a (slow) Posix engine using backtracking.

Perl has an obvious and easy to implement "right-bias" variant that is the mirror image. And Posix has an obvious and easy to implement "shortest match".

In Posix, declaring you want to match "a|b" is always exactly the same as "b|a". Where in Perl these can be very different instructions.

When writing a Perl regexp you may want more imperative control. So there are lazy variants '??' and '+?' and '*?' which are non-greedy and try the fewest number of repetitions first. And at least Java add posessive variants '?+' and '++' and '*+' that try the largest number of iterations that locally match, but will not backtrack to fewer repetitions; thus removing many possible paths to try from the search space. And Perl introduced lookahead assertions, where the engine checks if the future text a point matches or fails to match a nested regular expression. After a while these extensions comprise a whole imperative parsing language expressed in the form a single complicated string. So to make it easier to read Perl introduced comments and a whitespace ignoring mode that let you write better documented regular expressions. And it lets you (especially in Perl 6) embed fragments of Perl into the regular expressions. So Perl regexps give you a tremendous amount of power in how to find the match, which translates into power (for the expert) in defining what the match will be.

With Perl there is a unique first match, so knowing what the parenthesized subgroups matched is easy. In Posix there can several ambiguous ways to match the same longest answer. Two examples:

"12" =~ "(..)|(.)(.)" could match
- \0 = "12", \1 = "12", \1 = no match, \2 = no match
- \0 = "12", \1 = no match, \1="1", \2 = "2"
"12" =~ "(.?)*" could match
- \0 = "12", \1 = "" (empty match)
- \0 = "12", \1 = "2"

And the Posix standard requires a left bias in the first example and the non-empty final iteration in the second example. So "a|b" and "b|a" are distinguishable if there are ambiguous ways to match and these ambiguities affect the captured subgroups.

About the paper in the above section: It never mentions the difference between longest versus leftmost meaning of regular expressions. Replacing the Perl engine with a Posix DFA will simply break many carefully crafted regular expressions. I do not know if the engine Russ Cox touts that was written 20 years ago by Pike implements Posix or Perl semantics. If it finds the longest match (Posix) then it is no help in replacing the default engine in a language like Perl. If it does efficiently find the left-biased match then it would be possible. The comparison to Ville Laurikari's work is not encouraging in this regard since Ville's system finds the longest match and not the left-biased one.

Bounded space Posix proposal

For further discussions, Chris has posted his bounded space proposal for Posix algorithms. This is also a part of the discussion at a thread on Lambda The Ultimate.

Sample benchmark

The benchmark consists of matching "Hello world foo=123 whereas bar=456 Goodbye" against ".*foo=([0-9]+).*bar=([0-9]+).*" (as a lazy bytestring) 1mln times (along the lines of length . filter matchesPat . replicate 1000000). It has been performed on a 2x2GHz P4 machine, compiled with -O2, ghc 6.10.1.

regex-pcre: 0.02s

pcre-light: 0.02s

regex-tdfa 1.0.0 (no compiler bug): 0.03s

regex-dfa 0.91: 5.4s

regex-posix: 20s

regex-tdfa 1.0.0 (most probably compiler bug): 89s

In all cases, the pattern is compiled only once.

WARNING: This result of 97s may be the result of a compiler bug that probably may cause the pattern to be compiled at every match invocation; without that bug, regex-tdfa performs extremely well. Please see ticket http://hackage.haskell.org/trac/ghc/ticket/3059

UNWARNING: The code for regex-tdfa 1.0.0 is being improved. In particular the slow 89s is probably real. But it need not be so. The upcoming version runs in less than 0.2 seconds.

Thus, for patterns as simple as this one, it is appropriate to use pcre. For patterns where the Perl-style strategy of backtracking causes exponential backtracking blowup, it is appropriate to use regex-dfa or regex-tdfa.