Revision as of 17:01, 7 February 2007

This is a Shootout Entry for http://shootout.alioth.debian.org/benchmark.php?test=regexdna&lang=all.

Todo: this entry should be replaced by the new PCRE Text.Regex.ByteString module.

About the regex-dna (new) benchmark

Each program should:

read all of a redirected FASTA format file from stdin, and record the sequence length
use the same simple regex pattern match-replace to remove FASTA sequence descriptions and all

linefeed characters, and record the sequence length

use the same simple regex patterns, representing DNA 8-mers and their reverse complement (with a

wildcard in one position), and (one pattern at a time) count matches in the redirected file

write the regex pattern and count
use the same simple regex patterns to make IUB code alternatives explicit, and (one pattern at a

time) match-replace the pattern in the redirect file, and record the sequence length

write the 3 recorded sequence lengths

We use FASTA files generated by the fasta benchmark as input for this benchmark. Note: the file may include both lowercase and uppercase codes.

Correct output for this 100KB input file (generated with the fasta program N = 10000), is

   agggtaaa|tttaccct 0
   [cgt]gggtaaa|tttaccc[acg] 3
   a[act]ggtaaa|tttacc[agt]t 9
   ag[act]gtaaa|tttac[agt]ct 8
   agg[act]taaa|ttta[agt]cct 10
   aggg[acg]aaa|ttt[cgt]ccct 3
   agggt[cgt]aa|tt[acg]accct 4
   agggta[cgt]a|t[acg]taccct 3
   agggtaa[cgt]|[acg]ttaccct 5

   101745
   100000
   133640

Discussion

Don's Idiomatic Entry seems pretty well baked as the idiomatic GHC entry. For the high-speed GHC #2, we appear to be heading towards a new, full implementation of regular expressions in this entry. I would vote that we step back and implement enough to complete the entry because we've already bent the rules in this entry. One of the aspects that I like about my original non-Parsec entry was that it was fairly short and, IMO, fairly readable. In adding Parsec, I think that some of the comprehensibilitinessous has been lost.

I do think that we've improved the entry a lot! -- AlsonKemp

Is there any way to shorten the fastest entry? I still think that we've lost a lot of comprehensibility. -- AlsonKemp

Note that the Idiomatic Entry has already been submitted -- DonStewart

errmmm... Does the work that's done here suggest that the Text.Regex should be replaced? Not sure how to replace it or with which replacement... See also RegexSyntax. -- AlsonKemp

Well, if not replaced, we can at least provide alternatives. I propose:

 * we add a Word8 interface to the existing Text.Regex, so we can avoid marshalling from packed strings and buffers
 * we add the lazy regex combinators to the Text.ParserCombinators suite. I use them all the time, and they're more than a decade old, so they should be standardised. I'll begin by cabalising the library. -- DonStewart

Benchmarks

N=5,000,000, Linux/x86

Old spec

||Name || Time || ||Original || Timeout || ||Functional 1 || Timeout || ||Functional 2 || Timeout || ||Combination || Timeout || ||Packed entry || 1200s || ||Lazy lexers 1 || 33.950 || ||Alson Kemp's parsec || 30.682s || ||Lazy lexers 2 || 28.000 || ||Modified Alson || 19.950 || ||Lazy lexers 5 + regex syntax || 12.106 || ||Lazy lexers 4 || 11.192 || ||Lexer 6 || 11.000 || ||Lazy lexers 3 (illegal)|| 6.300 || |||||||| ||perl entry || 4.005s ||

New spec ||Name || Time || ||Parsec 1 || 126s || ||CTK 1 || 60s ||

N.B. the syntax highlighting breaks on many of the following programs, resulting in bogus output.

Chris' Bytestring Proposal

I have modified the old entry to use Data.ByteString and modified the matching engine to produce (start,stop) offsets. This runs really fast for me. Oddly, it does not use the 'Sparse' variant of 'Cont' but erasing 'Sparse' from the code makes it perform much much worse.

I expect there is a more performance to be tweaked, but memory consumption is good since I see nearly perfect productivity on the full data set (WinXP/Core Duo 2):

agggtaaa|tttaccct 36
[cgt]gggtaaa|tttaccc[acg] 125
a[act]ggtaaa|tttacc[agt]t 426
ag[act]gtaaa|tttac[agt]ct 290
agg[act]taaa|ttta[agt]cct 536
aggg[acg]aaa|ttt[cgt]ccct 153
agggt[cgt]aa|tt[acg]accct 143
agggta[cgt]a|t[acg]taccct 160
agggtaa[cgt]|[acg]ttaccct 219

5083411
5000000
6678892
5,742,128,636 bytes allocated in the heap
 12,691,824 bytes copied during GC (scavenged)
 30,713,012 bytes copied during GC (not scavenged)
  5,248,108 bytes maximum residency (5 sample(s))

      10944 collections in generation 0 (  0.06s)
          5 collections in generation 1 (  0.02s)

         11 Mb total memory in use

  INIT  time    0.02s  (  0.00s elapsed)
  MUT   time    8.36s  (  8.58s elapsed)
  GC    time    0.08s  (  0.09s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    8.45s  (  8.67s elapsed)

  %GC time       0.9%  (1.1% elapsed)

  Alloc rate    685,627,299 bytes per MUT second

  Productivity  98.9% of total user, 96.4% of total elapsed

The code

{-# OPTIONS -funbox-strict-fields -fbang-patterns #-}
--
-- This never uses 'Sparse' but remiving it killed performance -- ghc glitch?
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- http://haskell.org/haskellwiki/Shootout/Regex_DNA
-- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp.
-- Updated for ByteString by Chris Kuklewicz February, 2007
--
-- Compile with: -O2 -package parsec
--
-- An idiomatic Haskell entry using lazy regex combinators, described in the paper:
--
--  Manuel  M. T. Chakravarty, Lazy Lexing is Fast.
--  In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.

module Main(main) where

import Control.Monad (liftM)
import Data.Array (Array, assocs, accumArray,bounds,listArray)
import Data.Array.Base (unsafeAt)
import qualified Data.ByteString as B (ByteString,length,take,drop,index,concat,hGetContents)
import Data.ByteString.Base (c2w)
import qualified Data.ByteString.Char8  as BC (pack)
import qualified Data.ByteString.Lazy as L (length,fromChunks)
import Data.Word (Word8)
import Data.List (sort,nub,(\\))
import qualified Data.Map as M (Map,toList,fromList,findWithDefault)
import System.IO (stdin)
import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero)

(!) a b = unsafeAt a (fromEnum $ (b - fst (bounds a)))

main = do
  wholeFile <- B.hGetContents stdin
  putStr (work wholeFile)

work fileIn =
    let l0     = show $ B.length fileIn in l0 `seq`
    let clean  = B.concat . delete "\n|>[^\n]+\n" $ fileIn
        l1     = show $ B.length clean in l1 `seq`
    let counts = [ stringRegex++" "++show (count stringRegex clean) |
                   stringRegex <- variants ]  -- Count the variants one at a time.
        l2     = show $ L.length (iubsExpand clean)
    in unlines $ counts ++ [[],l0, l1, l2]

iubsExpand = L.fromChunks
             . foldr1 (.) (map (\snew input -> replace snew input) iubs)
             . (:[])

variants = [    "agggtaaa|tttaccct"    ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
           ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
           ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

iubs = map (\(s,new) -> (regex s,BC.pack new)) $
  [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)")
  ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")]

-- EXTERNAL INTERFACE TO REGULAR EXPRESSIONS

regex = (either (error.show) accept) . (\x -> P.parse p_regex x x)
  where accept re = re (Lexer True Done) -- Close a regular expression into a Lexer.

delete s = del where
  r = regex s
  del b = pieces 0 (run r b 0) where
    pieces end [] = B.drop end b : []
    pieces end ((start,stop):rest) = B.take (start-end) (B.drop end b) : pieces stop rest

count s b = runOverlappingCount r b 0
  where r = regex s

replace (r,new) = rep where
  rep [] = []
  rep (b:bs) = pieces 0 (run r b 0) where
    pieces 0 [] = b : rep bs
    pieces end [] | B.length b > end = B.drop end b : rep bs
                  | otherwise = rep bs
    pieces end ((start,stop):rest) | start > end = B.take (start-end) (B.drop end b) : new : pieces stop rest
                                   | otherwise = new : pieces stop rest

run :: Lexer -> B.ByteString -> Int -> [(Int,Int)]
run lexerIn b offsetIn = loop offsetIn
  where
    end = B.length b
    loop offset | offset == end = []
                | otherwise =
      let t@(start,stop) = lexOne b lexerIn offset
      in if start == -1 then  []
           else t : loop stop

runOverlappingCount :: Lexer -> B.ByteString -> Int -> Int
runOverlappingCount lexerIn b offsetIn = loop offsetIn 0
  where
    end = B.length b
    loop !offset !c | offset == end = c
                    | otherwise =
      let start = fst $ lexOne b lexerIn offset
      in if start == -1 then c
           else loop (succ start) (succ c)

----------------------------------------------------------------
-- Construct a regex combinator from a string regex (use Parsec)
-- Designed to follow "man re_format" (Mac OS X 10.4.4)
--
-- The regular expressions accepted by the program include those using
-- |, empty group (), grouping with ( and ), wildcard '.', backslach
-- escaped characters "\.", greedy modifiers ? + and *, bracketed
-- alternatives including ranges such as [a-z0-9] and inverted
-- brackets such as [^]\n-].  Only 7-bit Ascii accepted.

-- 'p_regex' is the only "exported" function, used by 'regex' above

p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|'))

p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom))

p_atom = P.try (P.string "()" >> return epsilon)
     <|> P.between (P.char '(') (P.char ')') p_regex
     <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char
     <|> (pzero <?> "cannot parse regexp atom")

p_post_atom atom = (P.char '?' >> return (atom `quest`))
               <|> (P.char '+' >> return (atom `plus`))
               <|> (P.char '*' >> return (atom `star`))
               <|> (return (atom +>))

p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) )

p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-")))
                  middle <- P.manyTill P.anyChar (P.char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-'  = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)) )
                  return $ if invert then alt ( ['\0'..'\127'] \\ characters )
                                     else alt characters

p_dot = P.char '.' >> return (alt ['\0'..'\127'])

p_escaped_char = P.char '\\' >> liftM char P.anyChar

p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\"

------------------------------------------------------------------------
--  And everything else is the midified CTK library.
--
--  Compiler Toolkit: Self-optimizing lexers
--
--  Author : Manuel M. T. Chakravarty
--  Created: 24 February 95, 2 March 99
--  Copyright (c) [1995..2000] Manuel M. T. Chakravarty
--  Copyright (c) 2004-6 Don Stewart
--
--  Self-optimizing lexer combinators.
--
--  For detailed information, see ``Lazy Lexing is Fast'', Manuel
--  M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of
--  Fourth Fuji International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.  (See my Web page for details.)
--
--             http://www.cse.unsw.edu.au/~chak/papers/Cha99.html
--
--  Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their
--  helpful suggestions that improved the design of this library.

-- INTERNAL IMPLEMENTATION OF REGULAR EXPRESSIONS

-- tree structure used to represent the lexer table
data Lexer = Lexer !Bool !Cont

-- represent the continuation of a lexer
            -- on top of the tree, where entries are dense, we use arrays
data Cont = Dense !BoundsNum !(Array Word8 Lexer)
          -- further down, where the valid entries are sparse, we
          -- use association lists, to save memory
          | Sparse !BoundsNum !(M.Map Word8 Lexer)
          -- end of a automaton
          | Done

type Regexp = Lexer -> Lexer

infixr 4 `quest`, `star`, `plus`
infixl 3 +>

-- Empty lexeme (noop)
epsilon = id :: Regexp

-- One character regexp
char c = (\l -> Lexer False (Dense (B 1 w w) (listArray (w,w) [l])))
  where w = c2w c

-- accepts a non-empty set of alternative characters
-- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length ws) (minimum ws) (maximum ws)
                in Lexer False (aggregate bnds [(w, l) | w <- ws])
  where ws = map c2w cs

-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs

-- Concatenation of regexps is just concatenation of functions
(+>)  = (.) :: Regexp -> Regexp -> Regexp

-- disjunctive combination of two regexps, corresponding to x|y
re1 >|< re2  = \l -> re1 l >||< re2 l

-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2

-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)

-- x `star` y corresponds to the regular expression x*y
star re1 re2  = \l -> let self = re1 self >||< re2 l in self

-- Scan forwards searching for a match anywhere at or after
-- startOffset.  Return offsets of first matching character and after
-- last matching character or (-1,-1) for failure
lexOne b lexerIn !startOffset = let stop = oneLexeme lexerIn startOffset (-1)
                                in if stop == -1
                                     then if startOffset < end
                                            then lexOne b lexerIn (succ startOffset)
                                            else (-1,-1)
                                     else (startOffset,stop)
  where
    end = B.length b
    oneLexeme (Lexer win cont) !offset !last =
      let last' = if win then offset else last
      in if offset == end
           then last' -- at end, has to be this action
           else oneChar cont (B.index b offset) (succ offset) last'   -- keep looking

    oneChar tbl !c !offset' !last = case peek tbl c of
      (Lexer win Done)   -> if win then offset' else last
      l'                 -> oneLexeme l' offset' last

    peek (Dense bn arr)  !c | c `inBounds` bn = arr ! c
    peek (Sparse bn cls) !c | c `inBounds` bn = M.findWithDefault (Lexer False Done) c cls
    peek _ _ = (Lexer False Done)

-- disjunctive combination of two lexers (longest match, right biased)
(Lexer win c) >||< (Lexer win' c')  = Lexer (win || win') (joinConts c c')

-- represents the number of (non-error) elements and the bounds of a
-- DFA transition table
data BoundsNum = B !Int !Word8 !Word8

-- combine two bounds
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')

-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc

-- combine two disjunctive continuations
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, M.toList cls)

-- we use the dense representation if a table has at least the given
-- number of (non-error) elements
denseMin  = 1 :: Int

-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer False Done) (lc, hc) cls)
  | otherwise     = Sparse bn (M.fromList (accum (>||<) cls))

-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

Proposed Legal entry : CTK 1

I have hacked and cleaned up the Lazy Lexer 6 to run one-at-a-time to be legal under the new rules. I will submit it after it getts benchmarked here. (Runs about 6 times slower on G4; still three times faster that the all parsec entry below) -- ChrisKuklewicz

{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- http://haskell.org/hawiki/ShootoutEntry
-- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp.
--
-- Compile with: -O2 -package parsec
--
-- An idiomatic Haskell entry using lazy regex combinators, described in the paper:
--
--  Manuel  M. T. Chakravarty, Lazy Lexing is Fast.  
--  In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.
--
-- For more about higher-order combinator-based lexing and parsing in Haskell consult:
--
--  Graham Hutton. Higher-order functions for parsing. (1992)
--  Journal of Functional Programming 2: 232-343. 
--
--  Jeroen Fokker. Functional Parsers. (1995) 
--  Lecture Notes of the Baastad Spring school on Functional Programming.
--
--  Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996)
--  Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham.
--
--  Steve Hill. Combinators for parsing expressions. (1996)
--  Journal of Functional Programming 6(3): 445-463. 
--
--  Andrew Partridge and David Wright. 
--  Predictive parser combinators need four values to report errors. (1996)
--  Journal of Functional Programming 6(2): 355-364.
--
--  Doaitse Swierstra and Luc Duponcheel. 
--  Deterministic, Error-Correcting Combinator Parsers. (1996)
--  Advanced Functional Programming. LNCS 1129: 185-207.
--
--  Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999)
--  Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138.
--
--  Doaitse Swierstra and Pablo Azero. 
--  Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999)
--  SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129.
--
--  Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) 
--  Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183.
--
-- And many other sources.
--
import List (sort,nub,(\\))
import Data.Array (Array, (!), assocs, accumArray)
import Control.Monad (liftM)
import qualified Data.Map as M
import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero)

main = interact $ \stdin ->
    let l0     = show $ length stdin in l0 `seq` 
    let clean  = fst $ run (reDelete "\n|>[^\n]+\n") stdin 0
        l1     = show $ length clean in l1 `seq` 
    let counts = [ stringRegex++" "++(show . snd . snd $ (run (reCount stringRegex) clean 0)) | 
                   stringRegex <- variants ]  -- Count the variants one at a time.
        l2     = show $ length (iubsExpand clean)
    in unlines $ counts ++ [[],l0, l1, l2]

-- Substitute certain chars for patterns. The replacements are running one at a time.
iubsExpand = foldl1 (.) (map (\snew input -> concat . fst $ run (reReplace snew) input 0) iubs)

variants = [    "agggtaaa|tttaccct"    ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
           ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
           ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

iubs = [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)")
 ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")]

----------------------------------------------------------------
-- Construct a regex combinator from a string regex (use Parsec)
-- Designed to follow "man re_format" (Mac OS X 10.4.4)
--
-- regex is the only "exported" function from this section
regex = (either (error.show) id) . (\x -> P.parse p_regex x x)

p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|'))

p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom))

p_atom = P.try (P.string "()" >> return epsilon)
     <|> P.between (P.char '(') (P.char ')') p_regex
     <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char 
     <|> (pzero <?> "cannot parse regexp atom")

p_post_atom atom = (P.char '?' >> return (atom `quest`))
               <|> (P.char '+' >> return (atom `plus`))
               <|> (P.char '*' >> return (atom `star`)) 
               <|> (return (atom +>))

p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) )

p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-")))
                  middle <- P.manyTill P.anyChar (P.char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-'  = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)) )
                  return $ if invert then alt ( ['\0'..'\127'] \\ characters )
                                     else alt characters

p_dot = P.char '.' >> return (alt ['\0'..'\127'])

p_escaped_char = P.char '\\' >> liftM char P.anyChar

p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\"

------------------------------------------------------------------------
--  And everything else is the CTK library.
--
--  Compiler Toolkit: Self-optimizing lexers
--
--  Author : Manuel M. T. Chakravarty
--  Created: 24 February 95, 2 March 99
--  Copyright (c) [1995..2000] Manuel M. T. Chakravarty
--  Copyright (c) 2004-6 Don Stewart
--
--  Self-optimizing lexer combinators.
--
--  For detailed information, see ``Lazy Lexing is Fast'', Manuel
--  M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of
--  Fourth Fuji International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.  (See my Web page for details.)
--
--             http://www.cse.unsw.edu.au/~chak/papers/Cha99.html
--
--  Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their
--  helpful suggestions that improved the design of this library.

-- EXTERNAL INTERFACE

infixr 4 `quest`, `star`, `plus`
infixl 3 +>, `action`, `meta`
infixl 2 >|<, >||<

reReplace (s,new) = (regex "." `action` Just) >||< (regex s `action` const (Just new))
reDelete s = (regex "." `action` \[c] -> Just c) >||< (regex s `action` const Nothing)
reCount s = regex s `meta1` \_ lexer m -> R Nothing (succ m) lexer

-- Empty lexeme (noop)
epsilon = id :: Regexp t

-- One character regexp
char c = (\l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l))) :: Regexp t

-- accepts a non-empty set of alternative characters
-- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length cs) (minimum cs) (maximum cs)
                in Lexer NoAction (aggregate bnds [(c, l) | c <- cs])

-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs

-- Concatenation of regexps is just concatenation of functions
(+>)  = (.) :: Regexp t -> Regexp t -> Regexp t

-- disjunctive combination of two regexps, corresponding to x|y
re1 >|< re2  = \l -> re1 l >||< re2 l

-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2

-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)

-- x `star` y corresponds to the regular expression x*y
--
-- The definition used below can be obtained by equational reasoning from this
-- one (which is much easier to understand): 
--
--   star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2
--
-- However, in the above, `self' is of type `Regexp s t' (ie, a functional),
-- whereas below it is of type `Lexer s t'.  Thus, below we have a graphical
-- body (finite representation of an infinite structure), which doesn't grow
-- with the size of the accepted lexeme - in contrast to the definition using
-- the functional recursion.
star re1 re2  = \l -> let self = re1 self >||< re2 l in self

-- Close a regular expression into a Lexer with an action that
-- converts the lexeme into a token.  action and meta advance by the
-- length of the lexeme while action1 and meta1 advance a single
-- character.
action re a  = re `meta` a' where a' lexeme lexer s = R (a lexeme) s lexer
{-# INLINE action #-}
action1 re a  = re `meta1` a' where a' lexeme lexer s = R (a lexeme) s lexer
{-# INLINE action1 #-}
meta re a  = re (Lexer (Action a) Done)
{-# INLINE meta #-}
meta1 re a  = re (Lexer (Action1 a) Done)
{-# INLINE meta1 #-}

-- disjunctive combination of two lexers (longest match, right biased)
(Lexer a c) >||< (Lexer a' c')  = Lexer (joinActions a a') (joinConts c c')

-- Take a Lexer, input string, and initial state
-- Returns ([token],(remaining input,final state))
run _ [] state = ([], ([],state))
run l csIn state = case lexOne l csIn state of
  (Nothing , _ , cs', state') -> run l cs' state'
  (Just t  , l', cs', state') -> let (ts, final) = run l' cs' state'; ts' = (t:ts)
                                 in ts' `seq` (ts',final)
  where
    -- accept a single lexeme
    lexOne l0 cs0 state0 = oneLexeme l0 cs0 state0 id lexErr

      where
        lexErr = (Just undefined, l, tail cs0, state0)

        -- we implement the "principle of the longest match" by taking
        -- a potential result quad down (in the last argument); the
        -- potential result quad is updated whenever we pass by an
        -- action (different from `NoAction'); initially it is lexErr
        oneLexeme (Lexer a cont) cs state csDL last =
          let last' = case a of NoAction -> last; _ -> doaction a (csDL []) cs state last
          in case cs of
               []      -> last'    -- at end, has to be this action
               (c:cs') -> last' `seq` oneChar cont c cs' state csDL last'   -- keep looking

        -- There are more chars. Look at the next one. Now, if the
        -- next tbl is Done, then there is no next transition, so
        -- immediately execute the matching action.
        oneChar tbl c cs state csDL last = case peek tbl c of
          (Lexer action Done)   -> doaction action (csDL [c]) cs state last
          l'                    -> oneLexeme l' cs state (csDL . (c:)) last

        -- Do the lookup of the current character in the DFA transition table.
        peek (Dense bn arr)  c | c `inBounds` bn = arr ! c
        peek (Sparse bn cls) c | c `inBounds` bn = M.findWithDefault (Lexer NoAction Done) c cls
        peek _ _ = (Lexer NoAction Done)

        -- execute the action if present and finalise the current lexeme
        doaction (Action f) csDL cs state _ =  case f (csDL) l0 state of
          (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' cs s'
          (R res     s' l')                   -> s' `seq` (res, l', cs, s')
        doaction (Action1 f) csDL cs state _ = case f (csDL) l0 state of
          (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' (tail csIn) s'
          (R res     s' l')                   -> s' `seq` (res, l', (tail csIn), s')
        doaction NoAction _ _ _ last = last

-- INTERNAL IMPLEMENTATION

-- represents the number of (non-error) elements and the bounds of a
-- DFA transition table
data BoundsNum = B !Int !Char !Char

-- we use the dense representation if a table has at least the given
-- number of (non-error) elements
denseMin  = 20 :: Int

-- combine two bounds
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')

-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc

-- Lexical actions take a lexeme with its position and may return a
-- token; in a variant, an error can be returned
--
-- * if there is no token returned, the current lexeme is discarded
--   lexing continues looking for a token
type Action t = String -> Lexer t -> Maybe t

-- Meta actions transform the lexeme, the old lexer, and a
-- user-defined state; they may return the old or a new lexer, which
-- is then used for accepting the next token (this is important to
-- implement non-regular behaviour like nested comments)
type Meta t = String -> Lexer t -> S -> Result t

data Result t = R (Maybe t) !S !(Lexer t)

-- threaded top-down during lexing (with current input)
type S = Int

-- tree structure used to represent the lexer table
--
-- * each node in the tree corresponds to a state of the lexer; the
--   associated actions are those that apply when the corresponding
--   state is reached
data Lexer t = Lexer (LexAction t) (Cont t)

-- represent the continuation of a lexer
            -- on top of the tree, where entries are dense, we use arrays
data Cont t = Dense BoundsNum (Array Char (Lexer t))
            -- further down, where the valid entries are sparse, we
            -- use association lists, to save memory
            | Sparse BoundsNum (M.Map Char (Lexer t))
            -- end of a automaton
            | Done

-- lexical action
data LexAction t = Action !(Meta t) | Action1 !(Meta t) | NoAction  -- need new LexAction for advance by 1

-- a regular expression
type Regexp t = Lexer t -> Lexer t

-- combine two disjunctive continuations
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct 
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, M.toList cls)

-- combine two actions. Use the latter in case of overlap (right biased!)
joinActions NoAction a'       = a'
joinActions a        NoAction = a
joinActions _        a'       = a' -- error "Lexers.>||<: Overlapping actions!"

-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls)
  | otherwise     = Sparse bn (M.fromList (accum (>||<) cls))

-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

Proposed Legal entry : Parsec

Parsec 1

I (ChrisKuklewicz) have resurrected the "All parsec" proposal and crafted this entry. It uses parsec to transform a string regular expression into a new parsec parser that accepts matching strings (as a prefix). This is used in reCount and in (reReplace, reReplace1, reReplaceMap) to perform the requisite operations.

Compile with -O2. On a powerbook G4 it uses 177 MB RSIZE and runs in 390 seconds (aka 6.5 minutes).

Note: The reReplaceMap used by super to perform the substitution of the IUBS is a possibly questionable optimization.

THIS HAS BEEN WITHDRAWN

The parser it created was "possessive" where it should have been "greedy"

The offending code was

type T s = CharParser s ()
type CT s = CharParser () (T s)

p_post_atom :: T s -> CT s
p_post_atom atom = (char '?' >> return (optional atom))
               <|> (char '+' >> return (skipMany1 atom))
               <|> (char '*' >> return (skipMany atom))
               <|> return atom

Parsec 2

This is a "fix"ed version of Parsec 1. It uses "star x cont = fix (\this -> try (x this) <|> cont)" to create all the backtacking points that are needed. To make this work, the parsers return CPS-style functions instead of parsers.

The fixed code is

-- February 25th, 2006

type T s = CharParser s ()
type TT s = T s -> T s
type CTT s = CharParser () (TT s)

p_post_atom :: TT s -> CTT s
p_post_atom atom = (char '?' >> return (\cont -> try (atom cont) <|> cont))
               <|> (char '+' >> return (atom . star atom))
               <|> (char '*' >> return (star atom))
               <|> return atom
  where star :: TT s -> T s -> T s
        star x cont = fix (\this -> try (x this) <|> cont)

All other changes were needed to handle to conversion to CPS-style, but did not alter the process of matching the regex.

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- For the regex-dna benchmark
--
-- http://haskell.org/hawiki/ShootoutEntry
--
-- March 3rd, 2006
-- 
-- Pure parsec solution
-- by Chris Kuklewicz and Don Stewart
--
-- This satisfies the new stringent specification
--
import Control.Monad.Fix (fix)
import Control.Monad (liftM,liftM2,msum,sequence)
import Data.Either (either)
import Data.List -- (intersperse,nub,sort,foldl')
import Text.ParserCombinators.Parsec
import Debug.Trace

main = interact $ \input ->
  let length1 = show $ length input in length1 `seq`
  let clean = reRemove ">[^\n]*\n|\n" input
      length2 = show $ length clean in length2 `seq`
  let table = map (\regexp -> regexp ++ (' ':show (reCount regexp clean))) variants
   -- length3 = show $ length (super clean)
   -- The foldl' below is more orthodox, but slower
      length3 = show $ length $ foldl' reReplace3 clean iubs
  in unlines $ table ++ ["",length1,length2,length3]

variants = ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
  ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
  ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

iubs = map (\ (c,new) -> (parseRegexp c, c, new) ) $
  [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)"),("M","(a|c)")
  ,("N","(a|c|g|t)"),("R","(a|g)"),("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)"),("Y","(c|t)")]

-- clean is a bit more clever, searching for each replacement in turn.
-- Each reReplaceMap takes a continuation to use for input that does
-- not match, each continuation leads to the next reReplaceMap (ending
-- in 'id' ).  This is not searching in parallel!
super = foldr1 (.) (map reReplaceMap iubs) id

-- Regexp INTERFACE

-- Replace each match of regexp with new in the input, processing everything before returning
-- (Used to clean in the input string)
reReplace regexp new input = 
  let regexp' = parseRegexp regexp
      parser = sepBy (many (notFollowedBy (try (regexp' >> return '\0')) >> anyChar)) regexp'
      result = either (error.show) id (parse parser regexp input)
  in concat $ intersperse new result

-- Replace each match of regexp with new in the input, returning the matching or non-matching blocks lazily
-- (Currently unused)
reReplace1 regexp new s =
  let regexp' = parseRegexp regexp
      parser = do value <- (try regexp' >> return new) 
                           <|> do c <- anyChar
                                  cs <- many (notFollowedBy (try (regexp' >> return '\0')) >> anyChar)
                                  return (c:cs)
                  rest <- getInput
                  return (value,rest)
      helper "" = ""
      helper input = let (this,next) = either (error.show) id (parse parser regexp input)
                     in this ++ helper next
  in helper s

takeAppend :: String -> Int -> String -> String
takeAppend s n e = loop s n
  where loop _ 0 = e
        loop (x:xs) n = x : loop xs (pred n)
{-# INLINE takeAppend #-}

takeS :: String -> Int -> ShowS -> ShowS
takeS s n e x = loop s n -- trace ("\n>"++show n++"<\n") loop s n
  where loop _ 0 = e x
        loop (x:xs) n = x : loop xs (pred n)

reReplace2 regexp new s =
  let newS = shows new
      regexp' = parseRegexp regexp
      parser = do start <- getInput
                  let loop n | n `seq` True = (try regexp' >> (return (takeAppend start n new)))
                                              <|> (anyChar >> loop (succ n)) <|> (eof >> return start)
                  liftM2 (,) (loop (0::Int)) getInput
      helper "" = ""
      helper input = let (this,next) = either (error.show) id (parse parser regexp input)
                     in this ++ helper next
  in helper s

reRemove regexp s = reReplace3 s (parseRegexp regexp,regexp,"")

reReplace3 s (regexp',regexp,new) =
  let newS = (new++)
      parser = do start <- getInput
                  let loop n | n `seq` True = 
                        (try regexp' >> updateState (. (takeS start n newS)) >> parser)
                        <|> (anyChar >> loop (succ n)) <|> (eof >> getState >>= \f -> return (f . (const start)))
                  loop 0
      result = either (error.show) id (runParser parser id regexp s) $ ""
  in result

-- Count the number of non-overlapping matches of regexp in the input
-- (Used to count each of the sequences)
reCount regexp input =
  let regexp' = parseRegexp regexp
      parser = (try regexp' >> updateState (+1) >> parser) <|> (anyChar >> parser) <|> (eof >> getState)
  in either (error.show) id (runParser parser (0::Int) regexp input)
                       
reCount' regexp input =
  let regexp' = parseRegexp regexp
      parser = (eof >> getState) <|> ( ( (try regexp' >> updateState (+1)) <|> (anyChar >> return ()) ) >> parser )
  in either (error.show) id (runParser parser (0::Int) regexp input)

-- When regexp' matches replace it with new otherwise pass the block the non-matching block to cont
-- (Used by clean to replace the IUBS codes)
reReplaceMap (regexp',regexp,new) cont input =
  let parser = do value <- (regexp' >> return new)
                           <|> do c <- anyChar
                                  cs <- many (notFollowedBy (try (regexp' >> return '\0')) >> anyChar)
                                  return (cont (c:cs))
                  rest <- getInput
                  return (value,rest)

      helper "" = ""
      helper rest = let (this,next) = either (error.show) id (parse parser regexp rest)
                    in this ++ helper next
  in helper input

-- Turn a string regexp into a Parsec parser
type T s = CharParser s ()
type TT s = T s -> T s
type CTT s = CharParser () (TT s)

parseRegexp :: String -> T s
parseRegexp = (either (error.show) ($ return ())) . (\x -> parse p_regexp x x)

-- Regexp IMPLEMENTATION (exporting p_regexp to Rexexp INTERFACE)

p_regexp :: CTT s
p_regexp = do branches <- sepBy1 p_branch (char '|')
              return (\cont -> let branches' = map (\branch -> try (branch cont)) branches
                               in msum branches')

p_branch :: CTT s                                   
p_branch = return . (foldr1 (.)) =<< many1 p_piece

p_piece :: CTT s
p_piece = p_atom >>= p_post_atom

p_atom :: CTT s
p_atom = try (string "()" >> return id) <|> 
         between (char '(') (char ')') p_regexp <|>
         p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char

p_post_atom :: TT s -> CTT s
p_post_atom atom = (char '?' >> return (\cont -> try (atom cont) <|> cont))
               <|> (char '+' >> return (atom . star atom))
               <|> (char '*' >> return (star atom))
               <|> return atom
  where star :: TT s -> T s -> T s
        star x cont = fix (\this -> try (x this) <|> cont)

p_bracket :: CTT s
p_bracket = char '[' >> ( (char '^' >> p_set True) <|> (p_set False) )

p_set :: Bool -> CTT s
p_set invert = do initial <- option "" ((char ']' >> return "]") <|> (char '-' >> return "-"))
                  middle <- manyTill anyChar (char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-' = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)) ) -- can't be empty
                  if invert then use (noneOf characters)
                            else use (oneOf characters)

p_dot :: CTT s
p_dot = char '.' >> use anyChar

p_escaped_char :: CTT s
p_escaped_char = char '\\' >> anyChar >>= use . char

p_other_char :: CTT s
p_other_char = label (noneOf specials) ("none of "++show specials) >>= use . char 
    where specials = "^.[$()|*+?\\"

-- use is needed to alter the returned type of character parsers to (TT s)
use :: CharParser s ignore -> CTT s
use x = return (x >>)

Alternatives

Speedup Idea - precalculate prefixes

precalculate regexp prefixes to drop more than one character on pattern failure.

at the moment, when a pattern match "try" fails, one character is dropped and the pattern is tried again. this leads to roughly "input length" trials for each of the patterns and each occurring "|"; making it (9 patterns * 2 ORs * input length = 9*2*100000 =) 1,800,000 applications of "try" even for the small input file.

{-
  find the maximal number of characters to skip when a regular expression parser
  just failed.
  
  when we know after how many characters a parser failed and where the first
  pattern may appear in a valid pattern, the minimum of these numbers should be
  the number of characters we can skip, if i'm not mistaken.

  NOTE: code is not fully tested and substring is not implemented for strings of
  different lengths.

  Contributed by: Johannes Ahlmann
-}
import Data.List
import Text.ParserCombinators.Parsec

-- split input regular expression into [[Char]] to be later expanded by sequence
-- NOTE: only supports non-negated brackets and "other" chars, yet
p_regexp = sepBy1 p_branch (char '|')
p_bracket = between (char '[') (char ']') (many1 $ noneOf "]")
p_branch  = many1 ((try p_bracket) <|> (noneOf specials >>= return . (:[])))
    where specials = "^.[$()|*+?\\"

-- substring function that also returns at what position "a" is contained in "b"
containsAt a b = if length b >= length a then length b else error "not implemented"

-- function to determine if and where "a" or one of its prefixes is the postfix of "b"
tailsAt a b = if null tl then lb else lb - (maximum tl)
  where tl = [length y | x <- (drop 1 . inits) a, y <- tails b, x == y]
        lb = length b
allowsSkip a b = min (tailsAt a b) (containsAt a b)

-- calculate the minimum number of characters to skip for all expanded variants of a regexp
skipper vars = if null skipLengths then minimum (map length vars) else minimum skipLengths
  where skipLengths = map (uncurry allowsSkip) [(a,b) | a <- vars, b <- vars, a /= b]

myparse = either (error.show) id . parse p_regexp ""
expand  = concatMap sequence . myparse

-- calculate the number of characters to skip, such that no other pattern could fit into them.
skipLength regexp = skipper $ expand regexp

How to gauage where the error occured

Catching errors inside a parser is impossible. You have to call runParser or parse to catch the error. And the "try" combinator hides the error position (it resets to 1st column). So I have invented a trick:

-- by ChrisKuklewicz
-- Proposed, untested trick to replace msum for p_regexp definition

-- Returns the Left Column for failure
--             Right Column for success
sub p = do i <- getInput
           let e = parse p "sub" i
           case e of
             Left pe -> do let c = sourceColumn (errorPos pe)
                           return (Left c)
             Right _ -> do c <- liftM sourceColumn getPosition
                           return (Right c)

-- Returns the Left Column for failures (minimum of two failure Columns)
--             Right Column for a success (use first success, left biased)
orSub p1 p2 = do ec1 <- p1
                 case ec1 of
                   Left c1 -> do ec2 <- p2
                                 case ec2 of
                                   Left c2 -> return (Left (min c1 c2))
                                   right -> return right
                   right -> return right

msumSub [x] = (sub x)
msumSub xs = foldr1 orSub (map sub xs)

At a higher level, we want to advance 1 character on success (Right _) or use the Speedup Idea from Johannes to decide what to skip on failure.

Another strategy is to modify {{{use}}} and replace {{{sequence_}}} in {{{p_branch}}} by {{{series}}} and {{{msum}}} in {{{p_regexp}}} by {{{msumSub}}}:

-- Never tested:

-- Don't let the parser fail, capture the error column in the return value explicitly
-- Note that x is a parser for exactly one character
use x = return ( (x >> liftM Right getPosition) <|> (liftM Left getPosition) )

-- series replaces the (>>) that sequence_ folds into the list with different abort semantics
series [x] = x
series (x:xs) = do ec <- x
                   case ec of
                     Right _ -> series xs
                     left -> return left

Hmmm...I like this one better, but {{{p_post_atom}}} with need similar fixing {{{(myUnit,myOptional,mySkipMany1,mySkipMany)}}}.

-- All I can say is it compiles; it has never been tested

-- replace return () for "()" atoms
myUnit = liftM Right getPosition

-- Replace optional for '?' modifier
myOptional p = do missed <- liftM Right getPosition
                  ec <- p
                  case ec of 
                    Left _ -> return missed
                    right -> return right

-- mySkip is Helper for mySkipMany and mySkipMany1
mySkip prev = do ec <- p
                 case ec of
                   Left _ -> return prev
                   right -> loop right

-- Replace skipMany for '*' modifier
mySkipMany p = liftM (mySkip.Right) getPosition

-- Replace skipMany1 for '+' modifier
mySkipMany1 p = do ec <- p
                   case ec of
                     right@(Right _) -> mySkip right
                     left -> return left

Re-inventing all the parsec combinators is tricky, but it just may work...

REs using TCL library

Yet without cleaning or substition.

On my computer it takes 13s to count the Regular Expressions matches on the big data set (N=500000) where the Ruby version takes 7s.

Obviously there are many things to improve/change. At the moment the input is treated as one large CString, which is not a valid approach for the not yet implemented substitution. Problems with dependence on C2Hs being installed and potential issues of legality also exist.

As it seems, TCL uses "Henry Spencer's RE library" which will most likely be easier/better to bind, because the TCL library is pretty much made to measure for TCL and not supposed to be used for external usage.

Profiling showed that 92% of the time is spent in "reExec", therefore all attempts at optimizing the Haskell portion of code are rather useless. Maybe there are some tweaks to get more performance out of the FFI, though.

{- | Regular Expressions with the TCL RE library.

  TCL development header have to be installed.

  To compile:
    c2hs /usr/include/tcl8.4/tcl-private/generic/regcustom.h T.chs
    ghc -ffi -ltcl8.4 --make T.hs

  Contributed by Johannes Ahlmann    
-}
module Main where
import Foreign
import Foreign.C



-- | find the relative start and end of a match. assumes at least one match was found
getMatchExtent re = do
  info <- mallocBytes {#sizeof Tcl_RegExpInfo#} -- little speedup in extracting
  reInfo re info
  start <- {#get Tcl_RegExpInfo.matches->start#} info
  end   <- {#get Tcl_RegExpInfo.matches->end#}   info
  free info
  return (start,end)
  
  
-- | count occurrences of regular expression in string
-- FIXME: maybe use "end" instead of "start+1" for validity (overlapping matches). speedwise it makes no difference.
countOccs re cString n = do 
  m <- reExec nullPtr re cString cString -- real start if different!??
  if m /= 1
    then return n 
    else do
      (start,end) <- getMatchExtent re
      countOccs re (advancePtr cString (fromIntegral $ start + 1)) (n+1)

  
---
    
    
rExpressions = ["agggtaaa|tttaccct"   
               ,"[cgt]gggtaaa|tttaccc[acg]" 
               ,"a[act]ggtaaa|tttacc[agt]t"
               ,"ag[act]gtaaa|tttac[agt]ct"
               ,"agg[act]taaa|ttta[agt]cct"
               ,"aggg[acg]aaa|ttt[cgt]ccct"
               ,"agggt[cgt]aa|tt[acg]accct"
               ,"agggta[cgt]a|t[acg]taccct"
               ,"agggtaa[cgt]|[acg]ttaccct" 
               ]


-- | given a string and a regular expression, count the REs number of occurences
counter s rx = withCString rx (\p -> do 
                  re <- reCompile nullPtr p
                  n  <- countOccs re s 0
                  return (rx, n))

main = do
  input  <- getContents
  counts <- withCString input (\s -> mapM (counter s) rExpressions)
  mapM (\(e,c) -> putStrLn (e ++ " " ++ show c)) counts 



reCompile = {#call unsafe Tcl_RegExpCompile as ^ #}
reExec    = {#call unsafe Tcl_RegExpExec    as ^ #}
reInfo    = {#call unsafe Tcl_RegExpGetInfo as ^ #}

Older illegal entries

Possible backup entry : lazy regex 5

Lazy regex 4, but with the parsec frontend to construct combinator expressions from strings dynamically (I guess we could use TH too).

See this thread for people complaining about combinator syntax for regexes.

It would be nice to see a Text.ParserCombnators.Regex made from this inteface..

{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- http://haskell.org/hawiki/ShootoutEntry
-- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp.
--
-- Compile with: -O2 -package parsec
--
-- An idiomatic Haskell entry using lazy regex combinators, described in the paper:
--
--  Manuel  M. T. Chakravarty, Lazy Lexing is Fast.  
--  In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.
--
-- For more about higher-order combinator-based lexing and parsing in Haskell consult:
--
--  Graham Hutton. Higher-order functions for parsing. (1992)
--  Journal of Functional Programming 2: 232-343. 
--
--  Jeroen Fokker. Functional Parsers. (1995) 
--  Lecture Notes of the Baastad Spring school on Functional Programming.
--
--  Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996)
--  Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham.
--
--  Steve Hill. Combinators for parsing expressions. (1996)
--  Journal of Functional Programming 6(3): 445-463. 
--
--  Andrew Partridge and David Wright. 
--  Predictive parser combinators need four values to report errors. (1996)
--  Journal of Functional Programming 6(2): 355-364.
--
--  Doaitse Swierstra and Luc Duponcheel. 
--  Deterministic, Error-Correcting Combinator Parsers. (1996)
--  Advanced Functional Programming. LNCS 1129: 185-207.
--
--  Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999)
--  Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138.
--
--  Doaitse Swierstra and Pablo Azero. 
--  Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999)
--  SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129.
--
--  Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) 
--  Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183.
--
-- And many other sources.
--

import Prelude hiding   (last)
import List
import Maybe
import Data.Array            (Array, (!), assocs, accumArray)
import qualified Data.Map    as M
import qualified Data.IntMap as I

import Control.Monad
import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>),GenParser)

main = interact $ \s0 ->
    let l0 = length s0 in l0 `seq` 
    let s1       = fst $ run clean (LS s0 I.empty)
        l1       = length s1 in l1 `seq` 
    let (LS _ m) = snd $ run count  (LS s1 I.empty)
        (LS _ n) = snd $ run count' (LS s1 I.empty)
        counts   = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i m)) $ (init variants)
        counts'  = (snd.last) variants++" "++show (I.findWithDefault 0 0 n)
        l2       = length . concat . fst $ run replace  (LS s1 I.empty)
    in unlines $ (counts++[counts']) ++ [[]] ++ map show [l0, l1, l2]

-- remove "\n" and lines with ">"
clean = (regex "."            `action` \[c] -> Just c)
   >||< (regex "\n|>[^\n]+\n" `action` const Nothing)

-- count all variants, accumulating count in state threaded through regex matcher
count  = foldl1 (>||<) $ [ match (regex s, n) | (n,s) <- variants ]
count' = match (regex ((snd.last) variants), 0) -- obscure issue with overlapping patterns

-- each pattern keeps track of its own count, so we can perform a single pass
match (p,i) = p `meta` \_ m -> R Nothing (I.insertWith (+) i 1 m) Nothing

variants = zip [0..]
  ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
  ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
  ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

-- substitute certain chars for patterns
replace = (regex "." `action` \c -> Just c)
     >||< foldl1 (>||<) (map (\(c,p) -> regex [c] `action` const (Just p)) pairs)

pairs = [('B',"(c|g|t)"),('D',"(a|g|t)"),  ('H',"(a|c|t)"),('K',"(g|t)")
        ,('M',"(a|c)"),  ('N',"(a|c|g|t)"),('R',"(a|g)"),  ('S',"(c|g)")
        ,('V',"(a|c|g)"),('W',"(a|t)"),    ('Y',"(c|t)") ]

------------------------------------------------------------------------
--
-- Construct a regex combinator expression from a string regex
--
-- I know, I know, in Haskell we'd just write the combinator expression
-- directly, but people complained that they're not `real' regular
-- expressions. That's clearly wrong in my opinion -- combinators
-- accurately describe and implement regular expressions (they're
-- semantically identical). The syntax is an embedded domain-specific
-- language in Haskell, not an untyped string. Sigh.
--

regex = (either (error.show) id) . (\x -> P.parse p_regex x x)

p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|'))

p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom))

p_atom = P.try (P.string "()" >> return epsilon) -- must not consume '(' yet
     <|> P.between (P.char '(') (P.char ')') p_regex
     <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char

p_post_atom atom = (P.char '?' >> return (atom `quest`))
               <|> (P.char '+' >> return (atom `plus`))
               <|> (P.char '*' >> return (atom `star`)) 
               <|> (return (atom +>))

p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) )

p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-")))
                  middle <- P.manyTill P.anyChar (P.char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-'  = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)) )
                  return $ if invert then alt ( ['\0'..'\127'] \\ characters )
                                     else alt characters

p_dot = P.char '.' >> return (alt ['\0'..'\127'])
p_escaped_char = P.char '\\' >> liftM char P.anyChar
p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\"

------------------------------------------------------------------------
-- And everything else is the CTK library.

--  Compiler Toolkit: Self-optimizing lexers
--
--  Author : Manuel M. T. Chakravarty
--  Created: 24 February 95, 2 March 99
--  Copyright (c) [1995..2000] Manuel M. T. Chakravarty
--  Copyright (c) 2004-6 Don Stewart
--
--  Self-optimizing lexer combinators.
--
--  For detailed information, see ``Lazy Lexing is Fast'', Manuel
--  M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of
--  Fourth Fuji International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.  (See my Web page for details.)
--
--             http://www.cse.unsw.edu.au/~chak/papers/Cha99.html
--
--  Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their
--  helpful suggestions that improved the design of this library.
--

infixr 4 `quest`, `star`, `plus`
infixl 3 +>, `action`, `meta`
infixl 2 >|<, >||<

-- we use the dense representation if a table has at least the given number of 
-- (non-error) elements
denseMin :: Int
denseMin  = 20

-- represents the number of (non-error) elements and the bounds of a table
data BoundsNum = B !Int !Char !Char

-- combine two bounds
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')

-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc

-- Lexical actions take a lexeme with its position and may return a token; in
-- a variant, an error can be returned
--
-- * if there is no token returned, the current lexeme is discarded lexing
--   continues looking for a token
type Action t = String -> Maybe t

-- Meta actions transform the lexeme, and a user-defined state; they
-- may return a lexer, which is then used for accepting the next token
-- (this is important to implement non-regular behaviour like nested
-- comments)
type Meta t = String -> S -> Result t

data Result t = R (Maybe t) !S (Maybe (Lexer t))

-- threaded top-down during lexing (current input, meta state)
data LexerState = LS !String !S

type S = I.IntMap Int

-- tree structure used to represent the lexer table
--
-- * each node in the tree corresponds to a state of the lexer; the associated 
--   actions are those that apply when the corresponding state is reached
data Lexer t = Lexer (LexAction t) (Cont t)

-- represent the continuation of a lexer
data Cont t = -- on top of the tree, where entries are dense, we use arrays
                Dense BoundsNum (Array Char (Lexer t))
                
                -- further down, where the valid entries are sparse, we
                -- use association lists, to save memory
              | Sparse BoundsNum (M.Map Char (Lexer t))

              | Done -- end of a automaton

-- lexical action
data LexAction t = Action !(Meta t) | NoAction

-- a regular expression
type Regexp t = Lexer t -> Lexer t

-- Empty lexeme
epsilon :: Regexp t
epsilon = id

-- One character regexp
char :: Char -> Regexp t
char c = \l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l))

-- Concatenation of regexps
(+>) :: Regexp t -> Regexp t -> Regexp t
(+>)  = (.)

-- Close a regular expression with an action that converts the lexeme into a
-- token
--
-- * Note: After the application of the action, the position is advanced
--         according to the length of the lexeme.  This implies that normal
--         actions should not be used in the case where a lexeme might contain 
--         control characters that imply non-standard changes of the position, 
--         such as newlines or tabs.
--
action re a  = re `meta` a' where a' lexeme s = R (a lexeme) s Nothing
{-# INLINE action #-}

-- Close a regular expression with a meta action
--
-- * Note: Meta actions have to advance the position in dependence of the
--         lexeme by themselves.
--
meta re a  = re (Lexer (Action a) Done)
{-# INLINE meta #-}

-- disjunctive combination of two regexps
re >|< re'  = \l -> re l >||< re' l

-- disjunctive combination of two lexers
(Lexer a c) >||< (Lexer a' c')  = Lexer (joinActions a a') (joinConts c c')

-- combine two disjunctive continuations
--
joinConts :: Cont t -> Cont t -> Cont t
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct 
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, M.toList cls)

-- combine two actions. Use the latter in case of overlap (!)
joinActions NoAction a'       = a'
joinActions a        NoAction = a
joinActions _        a'       = a' -- error "Lexers.>||<: Overlapping actions!"

-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls)
  | otherwise     = Sparse bn (M.fromList (accum (>||<) cls))

-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

-- x `star` y corresponds to the regular expression x*y
--
-- The definition used below can be obtained by equational reasoning from this
-- one (which is much easier to understand): 
--
--   star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2
--
-- However, in the above, `self' is of type `Regexp s t' (ie, a functional),
-- whereas below it is of type `Lexer s t'.  Thus, below we have a graphical
-- body (finite representation of an infinite structure), which doesn't grow
-- with the size of the accepted lexeme - in contrast to the definition using
-- the functional recursion.
star re1 re2  = \l -> let self = re1 self >||< re2 l in self

-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)

-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2

-- accepts a non-empty set of alternative characters
--  Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length cs) (minimum cs) (maximum cs)
                in Lexer NoAction (aggregate bnds [(c, l) | c <- cs])

-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs

-- apply a lexer, yielding a token sequence and a list of errors
--
-- * Currently, all errors are fatal; thus, the result is undefined in case of 
--   an error (this changes when error correction is added).
-- * The final lexer state is returned.
-- * The order of the error messages is undefined.
-- * the following is moderately tuned
--
run _ st@(LS [] _) = ([], st)
run l st = case lexOne l st of
    (Nothing , _ , st') -> run l st'
    (Just t, l', st')   -> let (ts, final) = run l' st'; ts' = (t:ts)
                           in ts' `seq` (ts',final)
  where
    -- accept a single lexeme
    lexOne l0 st'@(LS cs s) = oneLexeme l0 st' id lexErr

      where
        lexErr = (Just undefined, l, (LS (tail cs) s))

        -- we take an open list of characters down, where we accumulate the
        -- lexeme; this function returns maybe a token, the next lexer to use
        -- (can be altered by a meta action), the new lexer state, and a list
        -- of errors
        --
        -- we implement the "principle of the longest match" by taking a
        -- potential result quadruple down (in the last argument); the
        -- potential result quadruple is updated whenever we pass by an action 
        -- (different from `NoAction'); initially it is an error result
        --
        oneLexeme (Lexer a cont) state@(LS cs s) csDL last =
            let last' = doaction a csDL state last
            in case cs of
                []      -> last'    -- at end, has to be this action
                (c:cs') -> oneChar cont c (LS cs' s) csDL last'   -- keep looking

        -- There are more chars. Look at the next one
        -- Now, if the next tbl is Done, then there is no more
        -- transition, so immediately execute our action
        oneChar tbl c state csDL last = case peek tbl c of
                Nothing              -> last
                Just (Lexer a Done)  -> doaction a (\l -> csDL (c:l)) state last
                Just l'              -> oneLexeme l' state (\l -> csDL (c:l)) last

        -- Do the lookup.
        peek (Dense bn arr)  c | c `inBounds` bn = Just $ arr ! c
        peek (Sparse bn cls) c | c `inBounds` bn = M.lookup c cls
        peek _ _    = Nothing

        -- execute the action if present and finalise the current lexeme
        doaction (Action f) csDL (LS cs s) _ = case f (($[]) csDL) s of
                (R Nothing s' l') | not . null $ cs -> lexOne (fromMaybe l0 l') (LS cs s')
                (R res     s' l')  -> (res, (fromMaybe l0 l'), (LS cs s'))

        doaction NoAction _ _ last = last

Current entry : lazy regex 4

Uses regex-only transformations, and is thus a legal version of "Lazy regex 3". Also works around a as yet unidentified bug in the folded version of countMatches. Look for the old-skool `interact' :) This has been submitted. Space usage could be better though.

{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout - http://shootout.alioth.debian.org/
-- Haskell Wiki page for Shootout entries - http://haskell.org/hawiki/ShootoutEntry
--
-- Contributed by Don Stewart
-- Haskell RegEx C bindings are slow (see the other GHC entry), so this shows 
-- that Haskell can compete with a purely functional entry based on lazy 
-- regex combinators as detailed in the RegEx section.

import Prelude hiding   (last)
import List
import Maybe
import Data.Array            (Array, (!), assocs, accumArray)
import qualified Data.Map    as M
import qualified Data.IntMap as I

main = interact $ \s0 ->
    let l0 = length s0 in l0 `seq` 
    let s1       = fst $ run clean (LS s0 I.empty)
        l1       = length s1 in l1 `seq` 
    let (LS _ m) = snd $ run count  (LS s1 I.empty)
        (LS _ n) = snd $ run count' (LS s1 I.empty)
        counts   = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i m)) $ (init variants)
        counts'  = (snd.last) variants++" "++show (I.findWithDefault 0 0 n) -- work around for count'
        l2       = length . concat . fst $ run replace  (LS s1 I.empty)
    in unlines $ (counts++[counts']) ++ [[]] ++ map show [l0, l1, l2]

-- remove "\n" and lines with ">"
clean = (dot                                   `action` \[c] -> Just c)
   >||< (char '\n'                             `action` const Nothing)
   >||< (char '>' +> anyButNL `star` char '\n' `action` const Nothing)

-- count all variants, accumulating count in state threaded through regex matcher
-- note: count' is a workaround for an obscure problem with overlapping patterns(?)
count = (dot `action` const Nothing)
   >||< (foldl1 (>||<) $ map match (zip pats [0..]))
  where
    pats = [(string "agggtaaa"                            >|<  string "tttaccct")
           ,((alt "cgt" +> string "gggtaaa")              >|< (string "tttaccc" +> alt "acg"))
           ,((char 'a'   +> alt "act" +> string "ggtaaa") >|< (string "tttacc" +> alt "agt" +> t))
           ,((string "ag" +> alt "act" +> string "gtaaa") >|< (string "tttac" +> alt "agt" +> string "ct"))
           ,((string "agg" +> alt "act" +> string "taaa") >|< (string "ttta" +> alt "agt" +> string "cct"))
           ,((string "aggg" +> alt "acg" +> string "aaa") >|< (string "ttt" +> alt "cgt" +> string "ccct"))
           ,((string "agggt" +> alt "cgt" +> string "aa") >|< (string "tt" +> alt "acg" +> string "accct"))
           ,((string "agggta" +> alt "cgt" +> a)          >|< (char 't'   +> alt "acg" +> string "taccct"))]
count' = match (((string "agggtaa" +> alt "cgt") >|< (alt "acg" +> string "ttaccct")),0)

-- substitute certain chars for patterns
replace = (alt (['a'..'z'] ++ ['A'..'Z']) `action` \c -> Just c)
     >||< foldl1 (>||<) (map (\(c,p) -> char c `action` const (Just p)) pairs)

-- Utility functions and constants
-- each pattern keeps track of its own count, so we can perform a single pass
match (p,i) = p `meta` \_ m -> R Nothing (I.insertWith (+) i 1 m) Nothing

dot      = alt   ['\32' .. '\122']  -- defines the "." RegEx of any character
anyButNL = alt $ ['\32' .. '\122'] \\ ['\n'] -- same as "." but excludes newline
variants = zip [0..]
  ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
  ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
  ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]
pairs = [('B',"(c|g|t)"),('D',"(a|g|t)"),  ('H',"(a|c|t)"),('K',"(g|t)")
        ,('M',"(a|c)"),  ('N',"(a|c|g|t)"),('R',"(a|g)"),  ('S',"(c|g)")
        ,('V',"(a|c|g)"),('W',"(a|t)"),    ('Y',"(c|t)") ]

------------------------------------------------------------------------
-- And now the regex library
------------------------------------------------------------------------
--  Compiler Toolkit: Self-optimizing lexers
--
--  Author : Manuel M. T. Chakravarty
--  Created: 24 February 95, 2 March 99
--  Copyright (c) [1995..2000] Manuel M. T. Chakravarty
--  Copyright (c) 2004-6 Don Stewart
--
--  Self-optimizing lexer combinators.
--
--  For detailed information, see ``Lazy Lexing is Fast'', Manuel
--  M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of
--  Fourth Fuji International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.  (See my Web page for details.)
--
--             http://www.cse.unsw.edu.au/~chak/papers/Cha99.html
--
--  Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their
--  helpful suggestions that improved the design of this library.
--

infixr 4 `quest`, `star`, `plus`
infixl 3 +>, `action`, `meta`
infixl 2 >|<, >||<

-- we use the dense representation if a table has at least the given number of 
-- (non-error) elements
denseMin :: Int
denseMin  = 20

-- represents the number of (non-error) elements and the bounds of a table
data BoundsNum = B !Int !Char !Char

-- combine two bounds
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')

-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc

-- Lexical actions take a lexeme with its position and may return a token; in
-- a variant, an error can be returned
--
-- * if there is no token returned, the current lexeme is discarded lexing
--   continues looking for a token
type Action t = String -> Maybe t

-- Meta actions transform the lexeme, and a user-defined state; they
-- may return a lexer, which is then used for accepting the next token
-- (this is important to implement non-regular behaviour like nested
-- comments)
type Meta t = String -> S -> Result t

data Result t = R (Maybe t) !S (Maybe (Lexer t))

-- threaded top-down during lexing (current input, meta state)
data LexerState = LS !String !S

type S = I.IntMap Int

-- tree structure used to represent the lexer table
--
-- * each node in the tree corresponds to a state of the lexer; the associated 
--   actions are those that apply when the corresponding state is reached
data Lexer t = Lexer (LexAction t) (Cont t)

-- represent the continuation of a lexer
data Cont t = -- on top of the tree, where entries are dense, we use arrays
                Dense BoundsNum (Array Char (Lexer t))
                
                -- further down, where the valid entries are sparse, we
                -- use association lists, to save memory
              | Sparse BoundsNum (M.Map Char (Lexer t))

              | Done -- end of a automaton

-- lexical action
data LexAction t = Action !(Meta t) | NoAction

-- a regular expression
type Regexp t = Lexer t -> Lexer t

-- Empty lexeme
epsilon :: Regexp t
epsilon = id

-- One character regexp
char :: Char -> Regexp t
char c = \l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l))

-- Concatenation of regexps
(+>) :: Regexp t -> Regexp t -> Regexp t
(+>)  = (.)

-- Close a regular expression with an action that converts the lexeme into a
-- token
--
-- * Note: After the application of the action, the position is advanced
--         according to the length of the lexeme.  This implies that normal
--         actions should not be used in the case where a lexeme might contain 
--         control characters that imply non-standard changes of the position, 
--         such as newlines or tabs.
--
action re a  = re `meta` a' where a' lexeme s = R (a lexeme) s Nothing
{-# INLINE action #-}

-- Close a regular expression with a meta action
--
-- * Note: Meta actions have to advance the position in dependence of the
--         lexeme by themselves.
--
meta re a  = re (Lexer (Action a) Done)
{-# INLINE meta #-}

-- disjunctive combination of two regexps
re >|< re'  = \l -> re l >||< re' l

-- disjunctive combination of two lexers
(Lexer a c) >||< (Lexer a' c')  = Lexer (joinActions a a') (joinConts c c')

-- combine two disjunctive continuations
--
joinConts :: Cont t -> Cont t -> Cont t
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct 
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, M.toList cls)

-- combine two actions. Use the latter in case of overlap (!)
joinActions NoAction a'       = a'
joinActions a        NoAction = a
joinActions _        a'       = a' -- error "Lexers.>||<: Overlapping actions!"

-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls)
  | otherwise     = Sparse bn (M.fromList (accum (>||<) cls))

-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

-- x `star` y corresponds to the regular expression x*y
--
-- The definition used below can be obtained by equational reasoning from this
-- one (which is much easier to understand): 
--
--   star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2
--
-- However, in the above, `self' is of type `Regexp s t' (ie, a functional),
-- whereas below it is of type `Lexer s t'.  Thus, below we have a graphical
-- body (finite representation of an infinite structure), which doesn't grow
-- with the size of the accepted lexeme - in contrast to the definition using
-- the functional recursion.
star re1 re2  = \l -> let self = re1 self >||< re2 l in self

-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)

-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2

-- accepts a non-empty set of alternative characters
--  Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length cs) (minimum cs) (maximum cs)
                in Lexer NoAction (aggregate bnds [(c, l) | c <- cs])

-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs

-- apply a lexer, yielding a token sequence and a list of errors
--
-- * Currently, all errors are fatal; thus, the result is undefined in case of 
--   an error (this changes when error correction is added).
-- * The final lexer state is returned.
-- * The order of the error messages is undefined.
-- * the following is moderately tuned
--
run _ st@(LS [] _) = ([], st)
run l st = case lexOne l st of
    (Nothing , _ , st') -> run l st'
    (Just t, l', st')   -> let (ts, final) = run l' st'; ts' = (t:ts)
                           in ts' `seq` (ts',final)
  where
    -- accept a single lexeme
    lexOne l0 st'@(LS cs s) = oneLexeme l0 st' id lexErr

      where
        lexErr = (Just undefined, l, (LS (tail cs) s))

        -- we take an open list of characters down, where we accumulate the
        -- lexeme; this function returns maybe a token, the next lexer to use
        -- (can be altered by a meta action), the new lexer state, and a list
        -- of errors
        --
        -- we implement the "principle of the longest match" by taking a
        -- potential result quadruple down (in the last argument); the
        -- potential result quadruple is updated whenever we pass by an action 
        -- (different from `NoAction'); initially it is an error result
        --
        oneLexeme (Lexer a cont) state@(LS cs s) csDL last =
            let last' = doaction a csDL state last
            in case cs of
                []      -> last'    -- at end, has to be this action
                (c:cs') -> oneChar cont c (LS cs' s) csDL last'   -- keep looking

        -- There are more chars. Look at the next one
        -- Now, if the next tbl is Done, then there is no more
        -- transition, so immediately execute our action
        oneChar tbl c state csDL last = case peek tbl c of
                Nothing              -> last
                Just (Lexer a Done)  -> doaction a (\l -> csDL (c:l)) state last
                Just l'              -> oneLexeme l' state (\l -> csDL (c:l)) last

        -- Do the lookup.
        peek (Dense bn arr)  c | c `inBounds` bn = Just $ arr ! c
        peek (Sparse bn cls) c | c `inBounds` bn = M.lookup c cls
        peek _ _    = Nothing

        -- execute the action if present and finalise the current lexeme
        doaction (Action f) csDL (LS cs s) _ = case f (($[]) csDL) s of
                (R Nothing s' l') | not . null $ cs -> lexOne (fromMaybe l0 l') (LS cs s')
                (R res     s' l')  -> (res, (fromMaybe l0 l'), (LS cs s'))

        doaction NoAction _ _ last = last

Lazy regex 3

We build up a clever countMatches pass, that walks the input only once. Each variant keeps track of its own count. Lovely, functional, expressive, and, as Chris says, "smoking!". Not yet a final entry though.

{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- http://haskell.org/hawiki/ShootoutEntry
--
-- Contributed by Don Stewart
-- A regex-dna entry based on lazy regex combinators, described in the paper:
--
--  ``Lazy Lexing is Fast'', Manuel  M. T. Chakravarty, in A.
--  Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.
--

import Prelude hiding   (last)
import List
import Maybe
import Data.Array            (Array, (!), assocs, accumArray)
import qualified Data.IntMap as I

main = getContents >>= putStr . output

output s0 = unlines $ counts ++ [[]] ++ map show [length s0, length s1, length s2]
  where s1 = clean False s0
        s2 = replaceIubs s1
        (_,(LS _ m)) = execLexer patterns (LS s1 I.empty) 
        counts = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i m)) $ variants

-- match variants
patterns = foldl1 (>||<) $ map (uncurry match) (zip pats [0..])
  where
    pats = [(string "agggtaaa"                      >|< string "tttaccct")
           ,((cgt +> string "gggtaaa")              >|< (string "tttaccc" +> acg))
           ,((a +> act +> string "ggtaaa")          >|< (string "tttacc" +> agt +> t))
           ,((string "ag" +> act +> string "gtaaa") >|< (string "tttac" +> agt +> string "ct"))
           ,((string "agg" +> act +> string "taaa") >|< (string "ttta" +> agt +> string "cct"))
           ,((string "aggg" +> acg +> string "aaa") >|< (string "ttt" +> cgt +> string "ccct"))
           ,((string "agggt" +> cgt +> string "aa") >|< (string "tt" +> acg +> string "accct"))
           ,((string "agggta" +> cgt +> a)          >|< (t +> acg +> string "taccct"))
           ,((string "agggtaa" +> cgt)              >|< (acg +> string "ttaccct"))]

    cgt = alt "cgt"; act = alt "act"; acg = alt "acg"; agt = alt "agt"
    a = char 'a'; t = char 't'

    -- each pattern keeps track of its own count, so we can perform a single pass
    match p i = p `meta` \_ m -> R Nothing (I.insertWith (+) i 1 m) Nothing

variants = zip [0..]
  ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
  ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
  ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

------------------------------------------------------------------------

-- remove "\n" and lines with ">"
-- these should be a regex match. I don't think 'haskell pattern match' is legal :)
clean n []            = []
clean _     ('>':cs)  =    clean True  cs
clean _     ('\n':cs) =    clean False cs
clean True  (c:cs)    =    clean True  cs
clean False (c:cs)    = c:(clean False cs)

-- replace iubs, should also use regex matching
replaceIubs s = concatMap pairs s
pairs 'B' = "(c|g|t)"
pairs 'D' = "(a|g|t)"
pairs 'H' = "(a|c|t)"
pairs 'K' = "(g|t)"
pairs 'M' = "(a|c)"
pairs 'N' = "(a|c|g|t)"
pairs 'R' = "(a|g)"
pairs 'S' = "(c|g)"
pairs 'V' = "(a|c|g)"
pairs 'W' = "(a|t)"
pairs 'Y' = "(c|t)"
pairs a   = [a]

------------------------------------------------------------------------
-- And now the regex library

--  Compiler Toolkit: Self-optimizing lexers
--
--  Author : Manuel M. T. Chakravarty
--  Created: 24 February 95, 2 March 99
--  Copyright (c) [1995..2000] Manuel M. T. Chakravarty
--  Copyright (c) 2004-5 Don Stewart
--
--  Self-optimizing lexer combinators.
--
--  For detailed information, see ``Lazy Lexing is Fast'', Manuel
--  M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of
--  Fourth Fuji International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.  (See my Web page for details.)
--
--             http://www.cse.unsw.edu.au/~chak/papers/Cha99.html
--
--  Thanks to Simon L. Peyton Jones <simonpj@microsoft.com> and Roman
--  Lechtchinsky <wolfro@cs.tu-berlin.de> for their helpful suggestions that
--  improved the design of this library.
--

infixr 4 `quest`, `star`, `plus`
infixl 3 +>, `action`, `meta`
infixl 2 >|<, >||<

-- we use the dense representation if a table has at least the given number of 
-- (non-error) elements
--
denseMin :: Int
denseMin  = 20

-- represents the number of (non-error) elements and the bounds of a table
--
data BoundsNum = B !Int !Char !Char

-- combine two bounds
--
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')

-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc

-- Lexical actions take a lexeme with its position and may return a token; in
-- a variant, an error can be returned (EXPORTED)
--
-- * if there is no token returned, the current lexeme is discarded lexing
--   continues looking for a token
--
type Action    t = String -> Maybe t

-- Meta actions transform the lexeme, and a user-defined state; they
-- may return a lexer, which is then used for accepting the next token
-- (this is important to implement non-regular behaviour like nested
-- comments) (EXPORTED) 
--
type Meta t = String -> S -> Result t

data Result t = R (Maybe t) !S (Maybe (Lexer t))

-- threaded top-down during lexing (current input, meta state)
data LexerState = LS !String !S

type S = I.IntMap Int

-- tree structure used to represent the lexer table (EXPORTED ABSTRACTLY) 
--
-- * each node in the tree corresponds to a state of the lexer; the associated 
--   actions are those that apply when the corresponding state is reached
--
data Lexer t = Lexer (LexAction t) (Cont t)

-- represent the continuation of a lexer
--
data Cont t = -- on top of the tree, where entries are dense, we use arrays
                Dense BoundsNum (Array Char (Lexer t))
                
                -- further down, where the valid entries are sparse, we
                -- use association lists, to save memory
              | Sparse BoundsNum [(Char, Lexer t)]

                -- end of a automaton
              | Done

-- lexical action
data LexAction t = Action !(Meta t) | NoAction

-- a regular expression
type Regexp t = Lexer t -> Lexer t

-- Empty lexeme
epsilon :: Regexp t
epsilon = id

-- One character regexp
char :: Char -> Regexp t
char c = \l -> Lexer NoAction (Sparse (B 1 c c) [(c, l)])

-- Concatenation of regexps
(+>) :: Regexp t -> Regexp t -> Regexp t
(+>)  = (.)

-- Close a regular expression with an action that converts the lexeme into a
-- token
--
-- * Note: After the application of the action, the position is advanced
--         according to the length of the lexeme.  This implies that normal
--         actions should not be used in the case where a lexeme might contain 
--         control characters that imply non-standard changes of the position, 
--         such as newlines or tabs.
--
action re a  = re `meta` a'
  where a' lexeme s = R (a lexeme) s Nothing
{-# INLINE action #-}

-- Close a regular expression with a meta action (EXPORTED)
--
-- * Note: Meta actions have to advance the position in dependence of the
--         lexeme by themselves.
--
meta re a  = re (Lexer (Action a) Done)
{-# INLINE meta #-}

-- disjunctive combination of two regexps
re >|< re'  = \l -> re l >||< re' l

-- disjunctive combination of two lexers
(Lexer a c) >||< (Lexer a' c')  = Lexer (joinActions a a') (joinConts c c')

-- combine two disjunctive continuations
--
joinConts :: Cont t -> Cont t -> Cont t
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct 
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, cls)

-- combine two actions. Use the latter in case of overlap (!)
joinActions NoAction a'       = a'
joinActions a        NoAction = a
joinActions _        a'       = a' -- error "Lexers.>||<: Overlapping actions!"

-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls)
  | otherwise     = Sparse bn (accum (>||<) cls)

-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

-- x `star` y corresponds to the regular expression x*y (EXPORTED)
--
-- The definition used below can be obtained by equational reasoning from this
-- one (which is much easier to understand): 
--
--   star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2
--
-- However, in the above, `self' is of type `Regexp s t' (ie, a functional),
-- whereas below it is of type `Lexer s t'.  Thus, below we have a graphical
-- body (finite representation of an infinite structure), which doesn't grow
-- with the size of the accepted lexeme - in contrast to the definition using
-- the functional recursion.

star re1 re2  = \l -> let self = re1 self >||< re2 l in self

-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)

-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2

-- accepts a non-empty set of alternative characters
--  Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length cs) (minimum cs) (maximum cs)
                in Lexer NoAction (aggregate bnds [(c, l) | c <- cs])

-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs

-- apply a lexer, yielding a token sequence and a list of errors
--
-- * Currently, all errors are fatal; thus, the result is undefined in case of 
--   an error (this changes when error correction is added).
-- * The final lexer state is returned.
-- * The order of the error messages is undefined.
-- * the following is moderately tuned
--
execLexer _ st@(LS [] _) = ([], st)
execLexer l st = case lexOne l st of
    (Nothing , _ , st') -> execLexer l st'
    (Just t, l', st')   -> let (ts, final) = execLexer l' st' in (t:ts, final)
  where
    -- accept a single lexeme
    lexOne l0 st'@(LS cs s) = oneLexeme l0 st' id lexErr

      where
        lexErr = (Just undefined, l, (LS (tail cs) s))

        -- we take an open list of characters down, where we accumulate the
        -- lexeme; this function returns maybe a token, the next lexer to use
        -- (can be altered by a meta action), the new lexer state, and a list
        -- of errors
        --
        -- we implement the "principle of the longest match" by taking a
        -- potential result quadruple down (in the last argument); the
        -- potential result quadruple is updated whenever we pass by an action 
        -- (different from `NoAction'); initially it is an error result
        --
        oneLexeme (Lexer a cont) state@(LS cs s) csDL last =
            let last' = doaction a csDL state last
            in case cs of
                []      -> last'    -- at end, has to be this action
                (c:cs') -> oneChar cont c (LS cs' s) csDL last'   -- keep looking

        -- There are more chars. Look at the next one
        -- Now, if the next tbl is Done, then there is no more
        -- transition, so immediately execute our action
        oneChar tbl c state csDL last = case peek tbl c of
                Nothing              -> last
                Just (Lexer a Done)  -> doaction a (\l -> csDL (c:l)) state last
                Just l'              -> oneLexeme l' state (\l -> csDL (c:l)) last

        -- Do the lookup.
        peek (Dense bn arr)  c | c `inBounds` bn = Just $ arr ! c
        peek (Sparse bn cls) c | c `inBounds` bn = lookup c cls
        peek _ _    = Nothing

        -- execute the action if present and finalise the current lexeme
        doaction (Action f) csDL (LS cs s) _ = case f (($[]) csDL) s of
                (R Nothing s' l') | not . null $ cs -> lexOne (fromMaybe l0 l') (LS cs s')
                (R res     s' l')  -> (res, (fromMaybe l0 l'), (LS cs s'))

        doaction NoAction _ _ last = last

Front end

This may plug into the above to allow it to parse normal regular expressions.

-- By Chris Kuklewicz, using Pattern idea from Alson
-- Add this code block to parse normal string regexp to (Regexp Char)
-- This code has not been space optimized, and need not be speed optimized

import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>),GenParser)
import Control.Monad (liftM,liftM2)

-- The exported function from this code block

convert :: String -> Regexp Char
convert = reflect . parseRegex

-- The string regexp format, exported from this code block

variants = 
  ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
  ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
  ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

-- Make Regexp Char from Pattern

reflect :: Pattern -> Regexp Char
reflect PEnd         = epsilon
reflect (POr ps)     = foldl1 (>|<) (map reflect ps)
reflect (PConcat ps) = stitch ps
reflect (PQuest p)   = reflect p `quest` epsilon
reflect (PPlus p)    = reflect p `plus` epsilon
reflect (PStar p)    = reflect p `star` epsilon
reflect (PAny set)   = alt set
reflect (PAny' _)    = error "There is no [^abc] support"
reflect (PChar c)    = char c
reflect (PDot)       = error "There is no . support in Lexer"
reflect (PString s)  = string s

-- Help reflect (PConcat ps)
stitch :: [Pattern] -> Regexp Char
stitch []              = epsilon
stitch (x:[])          = reflect x
stitch ((PQuest p):xs) = reflect p `quest` stitch xs
stitch ((PPlus p) :xs) = reflect p `plus`  stitch xs
stitch ((PStar p) :xs) = reflect p `star`  stitch xs
stitch (x:xs) = (reflect x) +> stitch xs

-- Make p_regex to parse strings to Pattern

parseRegex = simplify.(either (error.show) id).(\x -> P.parse p_regex x x)

-- tr was for debugging
--tr = (\result -> do unsafePerformIO (print result >> return (return result)))
tr = return

p_regex = liftM POr (P.sepBy1 p_branch (P.char '|')) >>= tr

p_branch = liftM PConcat (P.many1 p_piece) >>= tr

p_piece = p_atom >>= p_post_atom >>= tr
             
p_atom = P.try (P.string "()" >> return PEnd)      -- must not consume '(' yet
         <|> P.between (P.char '(') (P.char ')') p_regex
         <|> p_bracket
         <|> p_dot
--         <|> p_caret
--         <|> p_dollar
         <|> p_escaped_char
         <|> p_other_char
--         <|> unsafePerformIO (print "no atom" >> return pzero)

-- p_post_atom takes the previous atom as a parameter
p_post_atom PEnd = return PEnd
p_post_atom atom = (P.char '?' >> return (PQuest atom))
                   <|> (P.char '+' >> return (PPlus atom))
                   <|> (P.char '*' >> return (PStar atom))
                   -- no {n,m} for p_bound yet
                   <|> (return atom)

p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) >>= tr

-- p_set does not support [.ch.] or [=y=] or [:foo:]
p_set :: Bool -> GenParser Char st Pattern
p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-")))
                  middle <- P.manyTill P.anyChar (P.char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-'  = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)))
                      result = if invert then PAny' characters 
                                         else PAny characters
                  return result

p_dot = P.char '.' >> return PDot

p_escaped_char = P.char '\\' >> liftM PChar P.anyChar

p_other_char = liftM PChar (P.noneOf specials)
  where specials  = "^.[$()|*+?\\"  -- full set is "^.[$()|*+?{\\" but no '{' yet for bounds

notPEnd = filter (/=PEnd)

simplify x =
  case x of
    POr ps -> case notPEnd (map simplify ps) of
                [] -> PEnd
                [p] -> p
                ps' -> POr ps'
    PConcat ps -> case merge (notPEnd (map simplify ps)) of
                    [] -> PEnd
                    [p] -> p
                    ps' -> PConcat ps'
    PQuest p -> case (simplify p) of PEnd -> PEnd; p' -> PQuest p'
    PPlus p -> case (simplify p) of PEnd -> PEnd; p' -> PPlus p'
    PStar p -> case (simplify p) of PEnd -> PEnd; p' -> PStar p'
    x -> x

-- merge helps simplify PConcat
merge ((PChar c1):(PChar c2):xs) = merge $ (PString [c1,c2]):xs
merge ((PString s):(PChar c):xs) = merge $ (PString (s++[c])):xs
merge (x:xs) = x:merge xs
merge [] = []

Use parsec as front end

In order to avoid the Text.Regex library, the entry can be coded up as follows -- using Parsec!

I only had time to add some regex parsing code, and escaping in show (PAny). postfix operators '+' and '*' can be easily added to ops table. It can now create "patterns" from "variants" -- ChrisKuklewicz

This is definitely the way to do it, however, this isn't yet a legal entry, in my opinion -- it is close though. The spec requires that the cleaning phases uses regular expressions too. So all string transformations must be regex based. However, noone says we have to use Text.Regex. The breakthough here is to use Parsec-style regular expressions. A beautiful advertisement of Haskell, I think. Also, make sure lines wrap roughly at 80 chars if possible. Another clever idea would be to write a read instance for Patterns, so that you simply {{{read "a.*b"}}} and get a Pattern back (int-e on #haskell's idea). Perhaps we could write a pure parsec version, and dispense with the translation of strings to parsec combinators all together? Not sure if non-haskellers would understand though. Depends on whether the result looks like regexes. -- dons

To dons question "Is this legal?" I think that it's in the grey area. Per my notes on the ShootoutEntry page, I vote that we submit: Functional 2 as the idiomatic GHC entry; this entry (in its finished form) as the fast GHC #2 entry noting that we didn't use the RegEx library because it was slow and it was just about as easy to code around it. Not all of the other entries are purely RegEx based (for the IUBs), so I don't feel too badly. Anyway, I think that this represents one of the neat aspects of Haskell: more so than C, C++, etc, it seems that there are always multiple, very different ways of doing things in Haskell.-- AlsonKemp

Response: ok, then is this the shortest, fastest entry? Or can we profile and improve it? -- dons

--
-- The Great Computer Language Shootout   - http://shootout.alioth.debian.org/
-- Haskell Wiki page for Shootout entries - http://haskell.org/hawiki/ShootoutEntry
--
-- Contributed by Alson Kemp
-- Faster program that doesn't use the RegEx library because of the
-- slowness of Haskell's bindings to the library 
--

import Text.Regex
import System.IO
import Data.List
-- Needed for Parser code
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr
import Control.Monad
import Data.Either

variants = 
  ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
  ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
  ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

pairsList = 
  [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)"),("M","(a|c)")
  ,("N","(a|c|g|t)"),("R","(a|g)"),("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)")
  ,("Y","(c|t)") ]

pep = buildExpressionParser ops p_factor

ops = [ [ Infix (char '|' >> return POr) AssocRight ] ]  
     -- [ Postfix (char '+' >> return PPlus), Postfix (char '*' >> return PStar) ] -- put before '|' for precedence

special = "[()]|*+\\"  -- no parenthesis grouping yet

s_normal esc = many1 (c_escaped <|> noneOf esc)  -- String
c_escaped = char '\\' >> anyChar                 -- Char

p_factor = p_any <|> p_string <|> (return PEnd)
p_string = liftM2 PString (s_normal special) p_factor
p_any = do s <- between (char '[') (char ']') (s_normal "]")
           liftM (PAny (nub (sort s))) p_factor

-- "parsed_variants" is identical to "patterns"
parsed_variants = map ((either (error.show) id).(\x -> runParser pep () x x)) variants

test_parsing = (parsed_variants == patterns) -- is True

main = getContents >>= putStr . output

output s = unlines $ countMatches ++ [[]] ++ map show [length s, length sclean, length sreplace]
  where sclean           = cleangene False s
        countMatches     = map (\p -> (show p) ++ ' ': show (numHits 0 p sclean)) patterns
        sreplace         = replaceIubs sclean

-- remove "\n" and lines with ">"
cleangene n []            = []   
cleangene _     ('>':cs)  =    cleangene True  cs
cleangene _     ('\n':cs) =    cleangene False cs
cleangene True  (c:cs)    =    cleangene True  cs
cleangene False (c:cs)    = c:(cleangene False cs)

-- count hits 
numHits n _ []     = n
numHits n p (s:ss) = case (pmatch (s:ss) p) of False      -> numHits n     p ss
			                       otherwise  -> numHits (n+1) p ss

data Pattern = POr     Pattern Pattern
             | PString String Pattern
             | PAny    String Pattern
             | PEnd deriving

Lexer 6: Bug Fixed and optimized version

This has fixed the bug that had required the annoying {{{count'}}} work-around. It uses {{{meta1}}} to indicate that the input should only be advanced by a single character after the lexeme matching action. -- ChrisKuklewicz

Various "optimizations": The {{{LS}}} data type has been removed. Usage of {{{Maybe}}} has been reduced: for the {{{peek}}} and the {{{Result}}} type.

Note that some of the tricks we play here may be illegal under the new tightened spec - Don

{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- http://haskell.org/hawiki/ShootoutEntry
-- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp.
--
-- Compile with: -O2 -package parsec
--
-- An idiomatic Haskell entry using lazy regex combinators, described in the paper:
--
--  Manuel  M. T. Chakravarty, Lazy Lexing is Fast.  
--  In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.
--
-- For more about higher-order combinator-based lexing and parsing in Haskell consult:
--
--  Graham Hutton. Higher-order functions for parsing. (1992)
--  Journal of Functional Programming 2: 232-343. 
--
--  Jeroen Fokker. Functional Parsers. (1995) 
--  Lecture Notes of the Baastad Spring school on Functional Programming.
--
--  Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996)
--  Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham.
--
--  Steve Hill. Combinators for parsing expressions. (1996)
--  Journal of Functional Programming 6(3): 445-463. 
--
--  Andrew Partridge and David Wright. 
--  Predictive parser combinators need four values to report errors. (1996)
--  Journal of Functional Programming 6(2): 355-364.
--
--  Doaitse Swierstra and Luc Duponcheel. 
--  Deterministic, Error-Correcting Combinator Parsers. (1996)
--  Advanced Functional Programming. LNCS 1129: 185-207.
--
--  Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999)
--  Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138.
--
--  Doaitse Swierstra and Pablo Azero. 
--  Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999)
--  SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129.
--
--  Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) 
--  Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183.
--
-- And many other sources.
--

import List (sort,nub,(\\))
import Data.Array (Array, (!), assocs, accumArray)
import qualified Data.Map    as M
import qualified Data.IntMap as I
import Control.Monad (liftM)
import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>))

main = interact $ \stdin ->
    let l0     = length stdin in l0 `seq` 
    let s1     = fst $ run clean stdin I.empty
        l1     = length s1 in l1 `seq` 
    let iMap   = snd.snd $ run count s1 I.empty
        counts = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i iMap)) variants
        l2     = length . concat . fst $ run replace s1 I.empty
    in unlines $ counts ++ ([] : map show [l0, l1, l2])

-- remove "\n" and lines with ">"
clean = (regex "."            `action` \[c] -> Just c)
   >||< (regex "\n|>[^\n]+\n" `action` const Nothing)    -- Right biased

-- count all variants, accumulating count in state threaded through regex matcher
count  = foldl1 (>||<) [ match (regex s, n) | (n,s) <- variants ]

-- each pattern keeps track of its own count, so we can perform a single pass
match (p,i) = p `meta1` \_ lexer m -> R Nothing (I.insertWith (+) i 1 m) lexer

variants = zip [0..]
  ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
  ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
  ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]

-- substitute certain chars for patterns
replace = (regex "." `action` \s -> Just s)
     >||< foldl1 (>||<) (map (\(c,p) -> regex [c] `action` const (Just p)) pairs) -- Right biased

pairs = [('B',"(c|g|t)"),('D',"(a|g|t)"),  ('H',"(a|c|t)"),('K',"(g|t)")
        ,('M',"(a|c)"),  ('N',"(a|c|g|t)"),('R',"(a|g)"),  ('S',"(c|g)")
        ,('V',"(a|c|g)"),('W',"(a|t)"),    ('Y',"(c|t)") ]

----------------------------------------------------------------
--
-- Construct a regex combinator from a string regex (use Parsec)
--
regex = (either (error.show) id) . (\x -> P.parse p_regex x x)

p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|'))

p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom))

p_atom = P.try (P.string "()" >> return epsilon)      -- must not consume '(' yet
     <|> P.between (P.char '(') (P.char ')') p_regex
     <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char

p_post_atom atom = (P.char '?' >> return (atom `quest`))
               <|> (P.char '+' >> return (atom `plus`))
               <|> (P.char '*' >> return (atom `star`)) 
               <|> (return (atom +>))

p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) )

p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-")))
                  middle <- P.manyTill P.anyChar (P.char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-'  = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)) )
                  return $ if invert then alt ( ['\0'..'\127'] \\ characters )
                                     else alt characters

p_dot = P.char '.' >> return (alt ['\0'..'\127'])

p_escaped_char = P.char '\\' >> liftM char P.anyChar

p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\"

------------------------------------------------------------------------
-- And everything else is the CTK library.

--  Compiler Toolkit: Self-optimizing lexers
--
--  Author : Manuel M. T. Chakravarty
--  Created: 24 February 95, 2 March 99
--  Copyright (c) [1995..2000] Manuel M. T. Chakravarty
--  Copyright (c) 2004-6 Don Stewart
--
--  Self-optimizing lexer combinators.
--
--  For detailed information, see ``Lazy Lexing is Fast'', Manuel
--  M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of
--  Fourth Fuji International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.  (See my Web page for details.)
--
--             http://www.cse.unsw.edu.au/~chak/papers/Cha99.html
--
--  Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their
--  helpful suggestions that improved the design of this library.
--

-- EXTERNAL INTERFACE

infixr 4 `quest`, `star`, `plus`
infixl 3 +>, `action`, `meta`
infixl 2 >|<, >||<

-- Empty lexeme
epsilon :: Regexp t
epsilon = id

-- One character regexp
char :: Char -> Regexp t
char c = \l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l))

-- Concatenation of regexps
(+>) :: Regexp t -> Regexp t -> Regexp t
(+>)  = (.)

-- x `star` y corresponds to the regular expression x*y
--
-- The definition used below can be obtained by equational reasoning from this
-- one (which is much easier to understand): 
--
--   star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2
--
-- However, in the above, `self' is of type `Regexp s t' (ie, a functional),
-- whereas below it is of type `Lexer s t'.  Thus, below we have a graphical
-- body (finite representation of an infinite structure), which doesn't grow
-- with the size of the accepted lexeme - in contrast to the definition using
-- the functional recursion.
star re1 re2  = \l -> let self = re1 self >||< re2 l in self

-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)

-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2

-- accepts a non-empty set of alternative characters
--  Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length cs) (minimum cs) (maximum cs)
                in Lexer NoAction (aggregate bnds [(c, l) | c <- cs])

-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs

-- disjunctive combination of two regexps
re >|< re'  = \l -> re l >||< re' l

-- disjunctive combination of two lexers
(Lexer a c) >||< (Lexer a' c')  = Lexer (joinActions a a') (joinConts c c')

-- Close a regular expression with an action that converts the lexeme into a
-- token
--
-- * Note: After the application of the action, the position is advanced
--         according to the length of the lexeme.  This implies that normal
--         actions should not be used in the case where a lexeme might contain 
--         control characters that imply non-standard changes of the position, 
--         such as newlines or tabs.
--
action re a  = re `meta` a' where a' lexeme lexer s = R (a lexeme) s lexer
{-# INLINE action #-}
action1 re a  = re `meta1` a' where a' lexeme lexer s = R (a lexeme) s lexer
{-# INLINE action1 #-}

-- Close a regular expression with a meta action
--
-- * Note: Meta actions have to advance the position in dependence of the
--         lexeme by themselves.
--
meta re a  = re (Lexer (Action a) Done)
{-# INLINE meta #-}
meta1 re a  = re (Lexer (Action1 a) Done)
{-# INLINE meta1 #-}

-- INTERNAL IMPLEMENTATION

-- we use the dense representation if a table has at least the given
-- number of (non-error) elements
denseMin :: Int
denseMin  = 20

-- represents the number of (non-error) elements and the bounds of a table
data BoundsNum = B !Int !Char !Char

-- combine two bounds
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')

-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc

-- Lexical actions take a lexeme with its position and may return a
-- token; in a variant, an error can be returned
--
-- * if there is no token returned, the current lexeme is discarded
--   lexing continues looking for a token
type Action t = String -> Lexer t -> Maybe t

-- Meta actions transform the lexeme, the old lexer, and a
-- user-defined state; they may return the old or a new lexer, which
-- is then used for accepting the next token (this is important to
-- implement non-regular behaviour like nested comments)
type Meta t = String -> Lexer t -> S -> Result t

data Result t = R (Maybe t) !S !(Lexer t)

-- threaded top-down during lexing (with current input)
type S = I.IntMap Int

-- tree structure used to represent the lexer table
--
-- * each node in the tree corresponds to a state of the lexer; the
--   associated actions are those that apply when the corresponding
--   state is reached
data Lexer t = Lexer (LexAction t) (Cont t)

-- represent the continuation of a lexer
            -- on top of the tree, where entries are dense, we use arrays
data Cont t = Dense BoundsNum (Array Char (Lexer t))
            -- further down, where the valid entries are sparse, we
            -- use association lists, to save memory
            | Sparse BoundsNum (M.Map Char (Lexer t))
            -- end of a automaton
            | Done

-- lexical action
data LexAction t = Action !(Meta t) | Action1 !(Meta t) | NoAction  -- need new LexAction for advance by 1

-- a regular expression
type Regexp t = Lexer t -> Lexer t

-- combine two disjunctive continuations
joinConts :: Cont t -> Cont t -> Cont t
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct 
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, M.toList cls)

-- combine two actions. Use the latter in case of overlap (right biased!)
joinActions NoAction a'       = a'
joinActions a        NoAction = a
joinActions _        a'       = a' -- error "Lexers.>||<: Overlapping actions!"

-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls)
  | otherwise     = Sparse bn (M.fromList (accum (>||<) cls))

-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

-- apply a lexer, yielding a token sequence and a list of errors
--
-- * Currently, all errors are fatal; thus, the result is undefined in case of 
--   an error (this changes when error correction is added).
-- * The final lexer state is returned.
-- * The order of the error messages is undefined.
-- * the following is moderately tuned
run _ [] state = ([], ([],state))
run l csIn state = case lexOne l csIn state of
    (Nothing , _ , cs', state') -> run l cs' state'
    (Just t  , l', cs', state') -> let (ts, final) = run l' cs' state'; ts' = (t:ts)
                                   in ts' `seq` (ts',final)
  where
    -- accept a single lexeme
    lexOne l0 cs0 state0 = oneLexeme l0 cs0 state0 id lexErr

      where
        lexErr = (Just undefined, l, tail cs0, state0)

        -- we take an open list of characters down, where we
        -- accumulate the lexeme; this function returns maybe a token,
        -- the next lexer to use (can be altered by a meta action),
        -- the new lexer state, and a list of errors
        --
        -- we implement the "principle of the longest match" by taking
        -- a potential result quad down (in the last argument); the
        -- potential result quad is updated whenever we pass by an
        -- action (different from `NoAction'); initially it is an
        -- error result
        oneLexeme (Lexer a cont) cs state csDL last =
            let last' = case a of NoAction -> last; _ -> doaction a (csDL []) cs state last
            in case cs of
                []      -> last'    -- at end, has to be this action
                (c:cs') -> last' `seq` oneChar cont c cs' state csDL last'   -- keep looking

        -- There are more chars. Look at the next one. Now, if the
        -- next tbl is Done, then there is no more transition, so
        -- immediately execute our action
        oneChar tbl c cs state csDL last = case peek tbl c of
                (Lexer action Done)   -> doaction action (csDL [c]) cs state last
                l'                    -> oneLexeme l' cs state (csDL . (c:)) last

        -- Do the lookup.
        peek (Dense bn arr)  c | c `inBounds` bn = arr ! c
        peek (Sparse bn cls) c | c `inBounds` bn = M.findWithDefault (Lexer NoAction Done) c cls
        peek _ _ = (Lexer NoAction Done)

        -- execute the action if present and finalise the current lexeme
        doaction (Action f) csDL cs state _ =  case f (csDL) l0 state of
                (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' cs s'
                (R res     s' l')                   -> s' `seq` (res, l', cs, s')
        doaction (Action1 f) csDL cs state _ = case f (csDL) l0 state of
                (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' (tail csIn) s'
                (R res     s' l')                   -> s' `seq` (res, l', (tail csIn), s')
        doaction NoAction _ _ _ last = last

Difference between revisions of "Shootout/Regex DNA"