Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Haskell
Wiki community
Recent changes
Random page
HaskellWiki
Search
Search
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Regular expressions
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Special pages
Page information
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== (apple|orange) == As Haskell is all about making data more strongly typed, I want to make the point that there are two popular Types of regular expressions in existence. Just as all String's are not the same Type, such as some may be escaped or encoded versions or hold data printed in different formats, a regular expressions like "a?a" means two different things depending on its actual Type. And this difference is very close to another thing that Haskell and functional programming emphasize : declarative programming (what) as opposed to imperative programming (how). I will call the two Types of regular expressions ''Posix'' and ''Perl''. ;Posix Regular Expressions :This is the declarative approach to regular expressions. The correct match of a regexp to a string of text starts at the leftmost possible position; there are no valid matches that can start before it. Of the possible matches that start at this leftmost position, the one that matches the longest substring is the correct match. :How this match is found is immaterial to this definition of the correct match. :Here and for the rest for the rest of this page I mean Posix to be modern "Posix Extended Regular Expressions" and never the older "Posix Basic Regular Expressions". ;Perl Regular Expressions :This is the imperative approach to regular expressions. The correct match of a regexp to a string of text starts at the leftmost possible position; there are no valid matches that can start before it. To choose the correct match that starts at this position match parts of the regexp against the text until the first match is found. Specifically you must try left branches of '|' before right branches, and treat '?' '*' and '+' as greedy. Greedy means to match as many iterations, and only backing off the number of repetitions if no complete match is possible. :The first match found may not be the longest, and it may not be the shortest. It is the left-biased choice. :This definition of a correct match is identical description of how a backtracking implementation would operate. To find a Perl match you usually do what Perl (and Python, and Java, etc.) do, which is try the left branches and greedy repetition first, then backtrack until the first answer is found. The number of path is the multiplication of alternatives. So "a?" repeated 'n' times has 2<sup>n</sup> paths. To find a Posix match you must try all possible matches from a starting point and pick the longest, moving on the next starting point if there was not match. If you must backtrack and check each path separately this will take exponential time. To avoid this you construct an automata (NFA or DFA) and then you can check "a?" repeated n times in polynomial time (I think it is O(n<sup>2</sup>) for NFA, O(n) for DFA). You don't have to use an automata; you can write a (slow) Posix engine using backtracking. Perl has an obvious and easy to implement "right-bias" variant that is the mirror image. And Posix has an obvious and easy to implement "shortest match". In Posix, declaring you want to match "a|b" is always exactly the same as "b|a". Where in Perl these can be very different instructions. When writing a Perl regexp you may want more imperative control. So there are ''lazy'' variants '??' and '+?' and '*?' which are non-greedy and try the fewest number of repetitions first. And at least Java add ''posessive'' variants '?+' and '++' and '*+' that try the largest number of iterations that locally match, but will not backtrack to fewer repetitions; thus removing many possible paths to try from the search space. And Perl introduced ''lookahead'' assertions, where the engine checks if the future text a point matches or fails to match a nested regular expression. After a while these extensions comprise a whole imperative parsing language expressed in the form a single complicated string. So to make it easier to read Perl introduced comments and a whitespace ignoring mode that let you write better documented regular expressions. And it lets you (especially in Perl 6) embed fragments of Perl into the regular expressions. So Perl regexps give you a tremendous amount of power in how to find the match, which translates into power (for the expert) in defining what the match will be. With Perl there is a unique first match, so knowing what the parenthesized subgroups matched is easy. In Posix there can several ambiguous ways to match the same longest answer. Two examples: * "12" =~ "(..)|(.)(.)" could match ** \0 = "12", \1 = "12", \1 = no match, \2 = no match ** \0 = "12", \1 = no match, \1="1", \2 = "2" * "12" =~ "(.?)*" could match ** \0 = "12", \1 = "" (empty match) ** \0 = "12", \1 = "2" And the Posix standard requires a left bias in the first example and the non-empty final iteration in the second example. So "a|b" and "b|a" are distinguishable if there are ambiguous ways to match and these ambiguities affect the captured subgroups. About the paper in the above section: It never mentions the difference between longest versus leftmost meaning of regular expressions. Replacing the Perl engine with a Posix DFA will simply break many carefully crafted regular expressions. I do not know if the engine Russ Cox touts that was written 20 years ago by Pike implements Posix or Perl semantics. If it finds the longest match (Posix) then it is no help in replacing the default engine in a language like Perl. If it does efficiently find the left-biased match then it would be possible. The comparison to Ville Laurikari's work is not encouraging in this regard since Ville's system finds the longest match and not the left-biased one.
Summary:
Please note that all contributions to HaskellWiki are considered to be released under simple permissive license (see
HaskellWiki:Copyrights
for details). If you don't want your writing to be edited mercilessly and redistributed at will, then don't submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
DO NOT SUBMIT COPYRIGHTED WORK WITHOUT PERMISSION!
Cancel
Editing help
(opens in new window)
Toggle limited content width