Difference between revisions of "Regular expressions"

Revision as of 13:48, 21 February 2007

Release of regex-tdfa

Chris Kuklewicz has just released regex-tdfa, (Tagged Deterministic Finite Automata), a new library that works with GHC 6.6. It is POSIX compliant and tested against the ATT tests.

DARCS

Get it at: Version 0.56, "stable" location at

darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-tdfa

The version that will get updated and broken more often is "unstable" at

darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-unstable/regex-tdfa

Original proposal content of this article

I just came across this article on Thompson Non-Finite Automata which presents an alternative implementation (to the one supposedly used in perl, ruby, python) which is MAAANY times faster.

Seing how badly we did in the shootout initially, wouldn't it be grand if "we" implemented this algorithm and could then outrun every other language WRT regular expressions, with a native haskell library...

The above article is slightly incomplete advocacy. I will explain in the (apples|oranges) section below ChrisKuklewicz 15:56, 30 January 2007 (UTC)

(apple|orange)

As Haskell is all about making data more strongly typed, I want to make the point that there are two popular Types of regular expressions in existence. Just as all String's are not the same Type, such as some may be escaped or encoded versions or hold data printed in different formats, a regular expressions like "a?a" means two different things depending on its actual Type.

And this difference is very close to another thing that Haskell and functional programming emphasize : declarative programming (what) as opposed to imperative programming (how).

I will call the two Types of regular expressions Posix and Perl.

Posix Regular Expressions: This is the declarative approach to regular expressions. The correct match of a regexp to a string of text starts at the leftmost possible position; there are no valid matches that can start before it. Of the possible matches that start at this leftmost position, the one that matches the longest substring is the correct match.; How this match is found is immaterial to this definition of the correct match.; Here and for the rest for the rest of this page I mean Posix to be modern "Posix Extended Regular Expressions" and never the older "Posix Basic Regular Expressions".

Perl Regular Expressions: This is the imperative approach to regular expressions. The correct match of a regexp to a string of text starts at the leftmost possible position; there are no valid matches that can start before it. To choose the correct match that starts at this position match parts of the regexp against the text until the first match is found. Specifically you must try left branches of '|' before right branches, and treat '?' '*' and '+' as greedy. Greedy means to match as many iterations, and only backing off the number of repetitions if no complete match is possible.; The first match found may not be the longest, and it may not be the shortest. It is the left-biased choice.; This definition of a correct match is identical description of how a backtracking implementation would operate.

To find a Perl match you usually do what Perl (and Python, and Java, etc.) do, which is try the left branches and greedy repetition first, then backtrack until the first answer is found. The number of path is the multiplication of alternatives. So "a?" repeated 'n' times has 2ⁿ paths.

To find a Posix match you must try all possible matches from a starting point and pick the longest, moving on the next starting point if there was not match. If you must backtrack and check each path separately this will take exponential time. To avoid this you construct an automata (NFA or DFA) and then you can check "a?" repeated n times in polynomial time (I think it is O(n²) for NFA, O(n) for DFA). You don't have to use an automata; you can write a (slow) Posix engine using backtracking.

Perl has an obvious and easy to implement "right-bias" variant that is the mirror image. And Posix has an obvious and easy to implement "shortest match".

In Posix, declaring you want to match "a|b" is always exactly the same as "b|a". Where in Perl these can be very different instructions.

When writing a Perl regexp you may want more imperative control. So there are lazy variants '??' and '+?' and '*?' which are non-greedy and try the fewest number of repetitions first. And at least Java add posessive variants '?+' and '++' and '*+' that try the largest number of iterations that locally match, but will not backtrack to fewer repetitions; thus removing many possible paths to try from the search space. And Perl introduced lookahead assertions, where the engine checks if the future text a point matches or fails to match a nested regular expression. After a while these extensions comprise a whole imperative parsing language expressed in the form a single complicated string. So to make it easier to read Perl introduced comments and a whitespace ignoring mode that let you write better documented regular expressions. And it lets you (especially in Perl 6) embed fragments of Perl into the regular expressions. So Perl regexps give you a tremendous amount of power in how to find the match, which translates into power (for the expert) in defining what the match will be.

With Perl there is a unique first match, so knowing what the parenthesized subgroups matched is easy. In Posix there can several ambiguous ways to match the same longest answer. Two examples:

"12" =~ "(..)|(.)(.)" could match
- \0 = "12", \1 = "12", \1 = no match, \2 = no match
- \0 = "12", \1 = no match, \1="1", \2 = "2"
"12" =~ "(.?)*" could match
- \0 = "12", \1 = "" (empty match)
- \0 = "12", \1 = "2"

And the Posix standard requires a left bias in the first example and the non-empty final iteration in the second example. So "a|b" and "b|a" are distinguishable if there are ambiguous ways to match and these ambiguities affect the captured subgroups.

About the paper in the above section: It never mentions the difference between longest versus leftmost meaning of regular expressions. Replacing the Perl engine with a Posix DFA will simply break many carefully crafted regular expressions. I do not know if the engine Russ Cox touts that was written 20 years ago by Pike implements Posix or Perl semantics. If it finds the longest match (Posix) then it is no help in replacing the default engine in a language like Perl. If it does efficiently find the left-biased match then it would be possible. The comparison to Ville Laurikari's work is not encouraging in this regard since Ville's system finds the longest match and not the left-biased one.

Bounded Space Posix proposal

For further discussions, I am posting my bounded space proposal for Posix algorithms at [RegexpDesign]. This is part of a discussion at Lambda The Ultimate.

@@ Line 63: / Line 63: @@
 About the paper in the above section: It never mentions the difference between longest versus leftmost meaning of regular expressions.  Replacing the Perl engine with a Posix DFA will simply break many carefully crafted regular expressions.  I do not know if the engine Russ Cox touts that was written 20 years ago by Pike implements Posix or Perl semantics.  If it finds the longest match (Posix) then it is no help in replacing the default engine in a language like Perl.  If it does efficiently find the left-biased match then it would be possible.  The comparison to Ville Laurikari's work is not encouraging in this regard since Ville's system finds the longest match and not the left-biased one.
+== Bounded Space Posix proposal ==
+For further discussions, I am posting my bounded space proposal for Posix algorithms at [RegexpDesign].  This is part of a discussion at [http://lambda-the-ultimate.org/node/2064 Lambda The Ultimate].