Regex Posix

From HaskellWiki
Revision as of 15:53, 2 February 2009 by ChrisKuklewicz (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

regex-posix bugs

The regular expessions provided by the GHC bundle (up to 6.10.1) are through a wrapping of the operating system's native C library. The C library calls in "regex.h" have been imported via FFI and wrapped by regex-posix, either the older 0.72.* version or the newer version (> 0.96.*).

Unfortunately the native platforms provide Posix regular experssions support that contains bugs and/or violates the specification. This is expecially true for the GNU C library (GLIBC) used by Linux distributions and the BSD C library used by OS X and FreeBSD and NetBSD (XBSD). More discussion of the standard is available from Glenn Fowler at AT&T.

regex-posix-unittest

Many of these unittests come from testregex by AT&T, the rest from the author of regex-tdfa (ChrisKuklewicz).

These problems are documented here, but examples of the bugs are testable on your system by way of the regex-posix-unittest package on Hackage. This package can be installed either as "--user" or "--global", and provides three things:

  1. The "regex-posix-unittest" binary executable
  2. The "manifest.txt" file which lists files to load test from
  3. A set of ".txt" files containing one unit test per line

Running the regex-posix-unittest program produces output for the passing or failing of each test in each file listed in "manifest.txt", which is shipped listing all the test files. This is followed a summary list of failing test id numbers.

Each unit test line has four fields, sepatated by whitespace (one tab in the above):

  1. The test identification Int, negative if this is expected to fail
  2. The regular expression pattern (extended regular expression)
  3. The text to search (no whitespace) # The expected result
    1. NOMATCH if the matching should fail ## "(n,m)(..)" if the match succeeds
      • Each pair of numbers denotes a substring of the text to search as two 0-based indices
      • The length of the substring is (m-n)
      • The first pair is the whole match, futher pairs are for parenthesized subexpression captures
      • Subexpressions that do not match are lists as "(?,?)" instead of with numbers

You can add and delete but please do not change existing tests.

If your platform is not mentioned in this wiki page (see below) then please either add it and your summary results or email the results to "TestRegexLazy -at- mightyreason -dot- com".

Types of failure

There are four classes of failures:

  1. Critical failures that find the wrong whole match (XBSD as of January 2009)
  2. Serious failures that find impossible subexpressions (XBSD)
  3. Serious failures that violate well-formed subexpressions (GLIBC)
  4. Failure to choose Posix captures when the answer is ambiguous (GLIBC and XBSD)

Example of an ambigious situation: Matching "a" with the the pattern "(.)?(a)?" or the pattern "(.)|(a)". Each pattern can use either "(.)" or "(a)" to match the "a', but not both; thus only one of the two subexpressions must succeed and the other must fail. In this case the right answer is that "(.)" succeeds in both cases, so the full answer is (0,1)(0,1)(?,?). The first pattern uses the right associativity rule for concatenation and the second pattern uses the leftmost subpattern bias.

Results and Bugs

OS X, FreeBSD, NetBSD

On a G4 version of OS X 10.5.6 I see:

("/Users/chrisk/local/share/regex-posix-unittest-1.0/basic3.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/class.txt",[3,9])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/left-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/right-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/forced-assoc.txt",[5,6,7,8,15,16,21,22])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/nullsub3.txt",[50,51])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/repetition2.txt",[101,102,103,104,105,106,107,
110,111,112,113,114,115,116,117,260,261,262,263,266,267,268,270,271])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/totest.txt",[5,30,33,209])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[1,-1,2,-2,3,-3,14,-14])

The "osx-bsd-critical.txt" unexpected failures are:

############################# Unexpected Fail # 1 #############################

Searched text: "ab"
Regex pattern: "(()|.)(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 2 #############################

Searched text: "ab"
Regex pattern: "(()|[ab])(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 3 #############################

Searched text: "aaab"
Regex pattern: "(()|[ab])+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(3,4)(3,3)(3,3)"

############################# Unexpected Fail # 14 #############################

Searched text: "aaab"
Regex pattern: "([ab]|())+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(0,4)(3,3)(3,3)"

Tests 1 and 2 should match the whole (0,2) or "ab" but only match the (1,2) of "b". Test 3 should match the (0,3) of "aabb" but only matches the (3,4) of "b". These are critical failures to match the whole text properly. This problem was found by QuickCheck and reported to OS X and FreeBSD and NetBSD (the latter two checked over IRC). OpenBSD reportedly did not show this bug (anecdote on IRC).

This also infects sed, and can be seen as

prompt$ echo "ab" | sed -En 's/(()|.)(b)/[&][\1]/p'
a[b][]

instead of the correct "[ab][a]" output.

The failure in Test 14 is different. Instead of the whole match being wrong the pattern being repeated by the "+" operator is matched 3 times against "a" and "a" and "a", followed by a fourth empty match at (3,3) before the final "b". Once the repeated pattern "[ab]|()" has matched a non-empty string it should not be used to match an empty-string.

OS X also gets this wrong in "totext.txt" #209:

Searched text: "yyyyyy"
Regex pattern: "(yyy|(x?)){2,4}"
Expected output: "(0,6)(3,6)(-1,-1)"
Actual result  : "(0,6)(6,6)(6,6)"

where the (x?) matches the empty string at (6,6) instead of stopping after 2 repetitions of "yyy".

But OS X is getting this right on numerous tests in "nullsub3.txt" such as

Expected Pass #8
text and pattern: "a"
Regex pattern: "(a*)*"
Outputs agree: "(0,1)(0,1)"

Expected Pass #47
text and pattern: "ax"
Regex pattern: "(a*)*(x)"
Outputs agree: "(0,2)(0,1)(1,2)"

Where (a*) matches "a" and is reported as (0,1) and the no additional repetition matches the empty string at (1,1).

The OS X failure in "totest.txt" #33 illustrates the "impossible" capture:

Searched text: "ababa"
Regex pattern: "(aba|ab|a)*"
Expected output: "(0,5)(2,5)"
Actual result  : "(0,5)(0,3)"

The whole match is (0,5) and the correct repetition has "ab" then "aba" matching, so the correct last repetition is the "aba" between indexes (2,5). The OS X code recognizes the whole match then does a second pass of greedy repetitons, so the first repetition is "aba" between indexes (0,3). The OS X code then fails to match the "ba" in this second pass but not care. It is logically impossible that the last iteration of the above subexpession ends before the whole match, but OS X reports the last repetition ended at index 3 even though the whole match ends at index 5.

This is also shown in numerous failed tests in "repetition2.txt" such as #260:

Searched text: "ababcd"
Regex pattern: "(a|ab|c|bcd){0,}(d*)"
Expected output: "(0,6)(3,6)(6,6)"
Actual result  : "(0,6)(4,5)(6,6)"

The correct repetitions are "ab" then "a" then "bcd", reported as (3,6). The incorrect greedy pass uses by OS X matches "ab" then "c" as (4,5) and then gives up even though (5,6) is still not matched in the second pass. As a sed test:

prompt$ echo "ababcd" | sed -En 's/(a|ab|c|bcd)*/[&][\1]/p'
[ababcd][c]

The above output should be impossible; the correct output is [ababcd][bcd].

The "forced-assoc.txt" failures show that one cannout use parenthesis to force the naturally right associative concatenation to be left associative. This is a violation of the Posix standard, and of the "man 7 re_format" page form OS X 10.5.6: "Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions". Take #15 as an example:

Searched text: "abc"
Regex pattern: "((a*)(b|abc))(c*)"
Expected output: "(0,3)(0,3)(0,0)(0,3)(3,3)"
Actual result  : "(0,3)(0,2)(0,1)(1,2)(2,3)"

Here ((a*)(b|abc))(c*) should be parsed as ((a*)(b|abc)) followed by (c*). As ((a*)(b|abc)) started earlier it should be as long as possible, at the expense of (c*). The longest match for this is the correct (0,3) not the incorrect (0,2) returned by OS X. And ((a*)(b|abc)) is a higher-level than its lower-level component (a*), so (a*) should match with the side constraint that ((a*)(b|abc)) matches (0,3), which means (a*) should match (0,0) instead of the longer (0,1). OS X gets this wrong.

GLIBC

to be posted