Regular expressions/Bounded space proposal

From HaskellWiki
< Regular expressions
Revision as of 18:23, 21 February 2007 by BrettGiles (talk | contribs) (RegexpDesign moved to Regular expressions/Bounded space proposal)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

I will try to summarize an abstract and unoptimized O(orbits) algorithm below. I have not written it down this way before.

Definitions and Properties:

(*)-operators have a tag before and after their whole usage. The leading tag is is of the Minimize kind and the trailing is of the Maximize kind.

(*)-operators have a unique numeric identifier that serves as a tag of the Orbit kind.

(*)-operators have subpatterns which may or may not contain submatches and other *-operators.

The tags values are compared in Minimize,Maximize,Orbit order before any tags in the subpattern.

The subpattern will be iterated 0 or more times. Each iteration is called an orbit.

We will take it that the subpattern may only match 0 characters if that is the first and only orbit.

The data for a *-operator is either Nothing or consists of a boolean flag "inOrbit" and a sequence of positions.

Initially the *-operator data is Nothing.

When starting each orbit/iteration:

   If the data is Nothing or the inOrbit flag is False
     then set the data to inOrbit True and
          a singleton list of the current offset.
   Otherwise the data has inOrbit flag of True and
     set the data to inOrbit True and append the
     current offset to the list of positions.

When leaving the last orbit/iteration:

   (I assert the data is not Nothing and inOrbit is True)
   Append the offset to the list of positions.
   Set the inOrbit flag to False.

When comparing two histories at the tag position of Orbit kind:

   (I assert that they are either both Nothing or the inOrbit flags will agree)
   compare each offset in their sequences in ascending order:
     If equal, go on to the next offsets.
     If unequal, then the one with the larger offset is the winner.
     If one sequence runs of out offsets before the other
       then the one with the shorter sequence is the winner.
     If the sequences run out simultaneously,
       then the orbit data is equal.

If the two histories compare as equal, then the ambiguity has to be resolved by further tag data which presumably will break the tie depending on what happened in the last iteration of the subpattern. Questions about previous iterations of the subpattern cannot affect the returned submatches.

Simple Optimizations (used in my code):

The first entry in the orbit offset sequences will always be the same offset as the Minimize tag. The Orbit tag will only be compared if the earlier Minimize tags agree. Thus storing and comparing this first offset again in the orbital data is duplication. Thus we can change the above instruction to create "a singleton list of the current offset" to create "an empty list".

When inOrbit is False it means the *-operator is finished and the Maximize tag for this whole group has also been set. The Maximize tag is compared before the orbit data, so we already know that the final offsets put into the sequence of offsets in the orbit history match. Thus when leaving the last orbit there is no need to "Append the offset to the list of positions" since this will not never help disambiguate the histories. WARNING: If using O(compressed) algorithm below than I think this optimization cannot be performed.

When starting each orbit any previous potential submatches from the subpattern will not be returned. Thus there is no longer any need to disambiguate the previous traversal of *-operators contained in the subpattern. So we can eagerly perform a (ResetOrbit) operation to set the data of all contained *-operators to Nothing at the start of each orbit. This frees space, and means we will not waste any time comparing irrelevant data.

Now to consider compressing the data to save space. Invariants during matching:

There will be one orbit history for a given *-operator for each NFA state.

The orbit data is kept so that in the future it can be compared to determine the preferred history in the event that two current NFA histories might eventually transition into the same NFA state.

If and when orbit histories are compared in the future then it is certain that the histories started at identical offsets.

All orbit histories with starting offsets in the past have already been started, so any future comparison of orbit histories with starting offsets in the past will be comparing direct descendants of the current histories. NOTE: This is less obvious with my tagged NFA than with yours dues to design differences I won't go into here.

The O(compressed) algorithm:

Change the orbit data to augment each orbital history with a natural number called Order which is initialized to 0.

Between steps of the NFA/DFA it is possible to pre-compare the orbit histories:

 For each *-operator:
   Find all the orbit data for a *-operator.
   Group these be their starting offset.
   For each group with a common starting offset:
     (I assert all members of the group have Order of 0 or all have non-zero Order)
     (I do NOT assert than inOrbit values will all agree)
     Sort the histories by Order and history:
        First sort by Order values, smaller values before larger ones.
        For identical Order values, sort by comparing the histories as they exist
          where winning entries are sorted before losing entries.
        Re-assign the Order values of the sorted data in ascending order starting with 1
        Ties are allowed in the sorting and must be assigned identical order values.
        Erase the recorded sequence of orbit offsets, leaving them empty.

This pre-comparison need not be done between every step of the NFA/DFA. This allows for a controlled time/space trade-off.

Once two histories have been assigned different Order numbers their descendants will always be sorted into this relative order.

If *-operator data at a given initial offset only exist in one NFA state then the compression procedure merely assigns Order to 1 and erases the orbit data.

Consider when two Orbit histories are compared when two NFA states transition to the same NFA state. The Order value can be compared first, with the smaller value the winner. If the Order values are identical then any orbit data that was not compressed can then be compared. If the compression procedure is correct then the resulting winner (or tie) is the same as if there had been no compression step.

In my head it is clear that the compression is correct in this sense. And it is clear that after compression that the orbit history contain an amount of data bounded by the size of the boolean inOrbit and the natural number Order and the empty list of positions. And it is clear that the number of orbit histories possible is the product of the number of *-operators and the number of NFA states.

The design above is an attempt at clarity, but I have implemented the compression, and I expect better designs exist.

If O(orbits) fails to produce Posix semantics then this failing might take a few forms:

 (a) Enough data is being collected but I discard old data when it needs to be retained
 (b) Enough data is being collected and retained but I am not using the data correctly
 (c) More data than the orbit offsets and the inOrbit flag is needed.

If O(orbits) is correct but then O(compressed) fails then there is some mistake:

 (i) New orbit data is later created with Order 0 and then sorted with non-zero Order data
 (ii) Future orbit offsets can do more than break ties in the current sorted histories
 (iii) Comparing future offsets requires not erasing the current list of offsets
 (iv) <insert other logical error here>

Now I assume that I have reinvented these algorithms. There are 22 Posix implementations in the comparison at Glen Fowler's site at AT&T and 18 of them are listed as getting the Posix *-operator disambiguation semantics correct. But as this is my hobby I decided that I preferred to try and reinvent this wheel.

ChrisKuklewicz 13:50, 21 February 2007 (UTC)