Shootout/Regex DNA
From HaskellWiki
This is a Shootout Entry for http://shootout.alioth.debian.org/benchmark.php?test=regexdna&lang=all.
Todo: this entry should be replaced by the new PCRE Text.Regex.ByteString module.
Contents |
1 About the regex-dna (new) benchmark
Each program should:
- read all of a redirected FASTA format file from stdin, and record the sequence length
- use the same simple regex pattern match-replace to remove FASTA sequence descriptions and all
linefeed characters, and record the sequence length
- use the same simple regex patterns, representing DNA 8-mers and their reverse complement (with a
wildcard in one position), and (one pattern at a time) count matches in the redirected file
- write the regex pattern and count
- use the same simple regex patterns to make IUB code alternatives explicit, and (one pattern at a
time) match-replace the pattern in the redirect file, and record the sequence length
- write the 3 recorded sequence lengths
We use FASTA files generated by the fasta benchmark as input for this benchmark. Note: the file may include both lowercase and uppercase codes.
Correct output for this 100KB input file (generated with the fasta program N = 10000), is
agggtaaa|tttaccct 0 [cgt]gggtaaa|tttaccc[acg] 3 a[act]ggtaaa|tttacc[agt]t 9 ag[act]gtaaa|tttac[agt]ct 8 agg[act]taaa|ttta[agt]cct 10 aggg[acg]aaa|ttt[cgt]ccct 3 agggt[cgt]aa|tt[acg]accct 4 agggta[cgt]a|t[acg]taccct 3 agggtaa[cgt]|[acg]ttaccct 5
101745 100000 133640
2 Discussion
Don's Idiomatic Entry seems pretty well baked as the idiomatic GHC entry. For the high-speed GHC #2, we appear to be heading towards a new, full implementation of regular expressions in this entry. I would vote that we step back and implement enough to complete the entry because we've already bent the rules in this entry. One of the aspects that I like about my original non-Parsec entry was that it was fairly short and, IMO, fairly readable. In adding Parsec, I think that some of the comprehensibilitinessous has been lost.
I do think that we've improved the entry a lot! -- AlsonKemp
Is there any way to shorten the fastest entry? I still think that we've lost a lot of comprehensibility. -- AlsonKemp
Note that the Idiomatic Entry has already been submitted -- DonStewart
errmmm... Does the work that's done here suggest that the Text.Regex should be replaced? Not sure how to replace it or with which replacement... See also RegexSyntax. -- AlsonKemp
Well, if not replaced, we can at least provide alternatives. I propose:
* we add a Word8 interface to the existing Text.Regex, so we can avoid marshalling from packed strings and buffers * we add the lazy regex combinators to the Text.ParserCombinators suite. I use them all the time, and they're more than a decade old, so they should be standardised. I'll begin by cabalising the library. -- DonStewart
3 Benchmarks
N=5,000,000, Linux/x86
Old spec
||Name || Time || ||Original || Timeout || ||Functional 1 || Timeout || ||Functional 2 || Timeout || ||Combination || Timeout || ||Packed entry || 1200s || ||Lazy lexers 1 || 33.950 || ||Alson Kemp's parsec || 30.682s || ||Lazy lexers 2 || 28.000 || ||Modified Alson || 19.950 || ||Lazy lexers 5 + regex syntax || 12.106 || ||Lazy lexers 4 || 11.192 || ||Lexer 6 || 11.000 || ||Lazy lexers 3 (illegal)|| 6.300 || |||||||| ||perl entry || 4.005s ||
New spec ||Name || Time || ||Parsec 1 || 126s || ||CTK 1 || 60s ||
N.B. the syntax highlighting breaks on many of the following programs, resulting in bogus output.
4 Proposed entry
A tuned version of Chris' Submitted
{-# OPTIONS -funbox-strict-fields -fbang-patterns #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- Updated for ByteString by Chris Kuklewicz February, 2007 -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. import Monad import Data.Array (Array, assocs, accumArray,bounds,listArray) import Data.Array.Base (unsafeAt) import Data.ByteString.Base import qualified Data.ByteString as B import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Char8 as BC import Word import List hiding (delete) import qualified Data.Map as M import System import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero) ------------------------------------------------------------------------ main = putStr . work =<< B.getContents work !fileIn = unlines $ counts ++ [[],l0, l1, l2] where l0 = show $ B.length fileIn clean = B.concat . delete "\n|>[^\n]+\n" $ fileIn l1 = show $ B.length clean counts = [ re++" "++show (count re clean) | re <- variants ] l2 = show $ L.length (iubsExpand clean) iubsExpand = L.fromChunks . foldr1 (.) (map replace iubs) . return variants = [ "agggtaaa|tttaccct" ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] iubs = map (\(s,new) -> (regex s,BC.pack new)) $ [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)") ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")] -- And that's it! ------------------------------------------------------------------------ -- external interface to regular expressions regex = (either (error.show) accept) . (\x -> P.parse p_regex x x) where accept re = re (Lexer True Done) -- Close a regular expression into a Lexer. (!) !a b = unsafeAt a (fromEnum $ (b - fst (bounds a))) delete s = del where r = regex s del b = pieces 0 (run r b 0) where pieces end [] = unsafeDrop end b : [] pieces !end ((start,stop):rest) = unsafeTake (start-end) (unsafeDrop end b) : pieces stop rest count s b = runOverlappingCount r b 0 where r = regex s replace (r,new) = rep where rep [] = [] rep (!b:bs) = pieces 0 (run r b 0) where pieces 0 [] = b : rep bs pieces end [] | B.length b > end = unsafeDrop end b : rep bs | otherwise = rep bs pieces end ((start,stop):rest) | start > end = unsafeTake (start-end) (unsafeDrop end b) : new : pieces stop rest | otherwise = new : pieces stop rest run :: Lexer -> B.ByteString -> Int -> [(Int,Int)] run lexerIn !b offsetIn = loop offsetIn where end = B.length b loop offset | offset == end = [] | otherwise = let t@(start,stop) = lexOne b lexerIn offset in if start == -1 then [] else t : loop stop runOverlappingCount :: Lexer -> B.ByteString -> Int -> Int runOverlappingCount lexerIn !b offsetIn = loop offsetIn 0 where end = B.length b loop !offset !c | offset == end = c | otherwise = let start = fst $ lexOne b lexerIn offset in if start == -1 then c else loop (succ start) (succ c) -- -- Construct a regex combinator from a string regex (use Parsec) -- Designed to follow "man re_format" (Mac OS X 10.4.4) -- -- The regular expressions accepted by the program include those using -- |, empty group (), grouping with ( and ), wildcard '.', backslach -- escaped characters "\.", greedy modifiers ? + and *, bracketed -- alternatives including ranges such as [a-z0-9] and inverted -- brackets such as [^]\n-]. Only 7-bit Ascii accepted. -- 'p_regex' is the only "exported" function, used by 'regex' above p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|')) p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom)) p_atom = P.try (P.string "()" >> return epsilon) <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char <|> (pzero <?> "cannot parse regexp atom") p_post_atom atom = (P.char '?' >> return (atom `quest`)) <|> (P.char '+' >> return (atom `plus`)) <|> (P.char '*' >> return (atom `star`)) <|> (return (atom +>)) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) return $ if invert then alt ( ['\0'..'\127'] \\ characters ) else alt characters p_dot = P.char '.' >> return (alt ['\0'..'\127']) p_escaped_char = P.char '\\' >> liftM char P.anyChar p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\" -- -- And everything else is the modified CTK library. -- -- Compiler Toolkit: Self-optimizing lexers -- Author : Manuel M. T. Chakravarty -- -- tree structure used to represent the lexer table data Lexer = Lexer !Bool !Cont -- represent the continuation of a lexer -- on top of the tree, where entries are dense, we use arrays data Cont = Dense !BoundsNum !(Array Word8 Lexer) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse !BoundsNum !(M.Map Word8 Lexer) -- end of a automaton | Done type Regexp = Lexer -> Lexer infixr 4 `quest`, `star`, `plus` infixl 3 +> -- Empty lexeme (noop) epsilon = id :: Regexp -- One character regexp char c = (\l -> Lexer False (Dense (B 1 w w) (listArray (w,w) [l]))) where w = c2w c -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length ws) (minimum ws) (maximum ws) in Lexer False (aggregate bnds [(w, l) | w <- ws]) where ws = map c2w cs -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- Concatenation of regexps is just concatenation of functions (+>) = (.) :: Regexp -> Regexp -> Regexp -- disjunctive combination of two regexps, corresponding to x|y re1 >|< re2 = \l -> re1 l >||< re2 l -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `star` y corresponds to the regular expression x*y star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- Scan forwards searching for a match anywhere at or after -- startOffset. Return offsets of first matching character and after -- last matching character or (-1,-1) for failure lexOne !b lexerIn !startOffset = let stop = oneLexeme lexerIn startOffset (-1) in if stop == -1 then if startOffset < end then lexOne b lexerIn (succ startOffset) else (-1,-1) else (startOffset,stop) where end = B.length b oneLexeme (Lexer win cont) !offset !last = let last' = if win then offset else last in if offset == end then last' -- at end, has to be this action else oneChar cont (unsafeIndex b offset) (succ offset) last' -- keep looking oneChar tbl !c !offset' !last = case peek tbl c of (Lexer win Done) -> if win then offset' else last l' -> oneLexeme l' offset' last peek (Dense bn arr) !c | c `inBounds` bn = arr ! c peek (Sparse bn cls) !c | c `inBounds` bn = M.findWithDefault (Lexer False Done) c cls peek _ _ = (Lexer False Done) -- disjunctive combination of two lexers (longest match, right biased) (Lexer win c) >||< (Lexer win' c') = Lexer (win || win') (joinConts c c') -- represents the number of (non-error) elements and the bounds of a -- DFA transition table data BoundsNum = B !Int !Word8 !Word8 -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- combine two disjunctive continuations joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- we use the dense representation if a table has at least the given -- number of (non-error) elements denseMin = 1 :: Int -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer False Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes')
5 Chris' Bytestring Proposal
I have modified the old entry to use Data.ByteString and modified the matching engine to produce (start,stop) offsets. This runs really fast for me. Oddly, it does not use the 'Sparse' variant of 'Cont' but erasing 'Sparse' from the code makes it perform much much worse.
I expect there is a more performance to be tweaked, but memory consumption is good since I see nearly perfect productivity on the full data set (WinXP/Core Duo 2):
agggtaaa|tttaccct 36 [cgt]gggtaaa|tttaccc[acg] 125 a[act]ggtaaa|tttacc[agt]t 426 ag[act]gtaaa|tttac[agt]ct 290 agg[act]taaa|ttta[agt]cct 536 aggg[acg]aaa|ttt[cgt]ccct 153 agggt[cgt]aa|tt[acg]accct 143 agggta[cgt]a|t[acg]taccct 160 agggtaa[cgt]|[acg]ttaccct 219 5083411 5000000 6678892 5,742,128,636 bytes allocated in the heap 12,691,824 bytes copied during GC (scavenged) 30,713,012 bytes copied during GC (not scavenged) 5,248,108 bytes maximum residency (5 sample(s)) 10944 collections in generation 0 ( 0.06s) 5 collections in generation 1 ( 0.02s) 11 Mb total memory in use INIT time 0.02s ( 0.00s elapsed) MUT time 8.36s ( 8.58s elapsed) GC time 0.08s ( 0.09s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.45s ( 8.67s elapsed) %GC time 0.9% (1.1% elapsed) Alloc rate 685,627,299 bytes per MUT second Productivity 98.9% of total user, 96.4% of total elapsed
5.1 The code
{-# OPTIONS -funbox-strict-fields -fbang-patterns #-} -- -- This never uses 'Sparse' but remiving it killed performance -- ghc glitch? -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- http://haskell.org/haskellwiki/Shootout/Regex_DNA -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- Updated for ByteString by Chris Kuklewicz February, 2007 -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. module Main(main) where import Control.Monad (liftM) import Data.Array (Array, assocs, accumArray,bounds,listArray) import Data.Array.Base (unsafeAt) import qualified Data.ByteString as B (ByteString,length,take,drop,index,concat,hGetContents) import Data.ByteString.Base (c2w) import qualified Data.ByteString.Char8 as BC (pack) import qualified Data.ByteString.Lazy as L (length,fromChunks) import Data.Word (Word8) import Data.List (sort,nub,(\\)) import qualified Data.Map as M (Map,toList,fromList,findWithDefault) import System.IO (stdin) import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero) (!) a b = unsafeAt a (fromEnum $ (b - fst (bounds a))) main = do wholeFile <- B.hGetContents stdin putStr (work wholeFile) work fileIn = let l0 = show $ B.length fileIn in l0 `seq` let clean = B.concat . delete "\n|>[^\n]+\n" $ fileIn l1 = show $ B.length clean in l1 `seq` let counts = [ stringRegex++" "++show (count stringRegex clean) | stringRegex <- variants ] -- Count the variants one at a time. l2 = show $ L.length (iubsExpand clean) in unlines $ counts ++ [[],l0, l1, l2] iubsExpand = L.fromChunks . foldr1 (.) (map (\snew input -> replace snew input) iubs) . (:[]) variants = [ "agggtaaa|tttaccct" ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] iubs = map (\(s,new) -> (regex s,BC.pack new)) $ [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)") ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")] -- EXTERNAL INTERFACE TO REGULAR EXPRESSIONS regex = (either (error.show) accept) . (\x -> P.parse p_regex x x) where accept re = re (Lexer True Done) -- Close a regular expression into a Lexer. delete s = del where r = regex s del b = pieces 0 (run r b 0) where pieces end [] = B.drop end b : [] pieces end ((start,stop):rest) = B.take (start-end) (B.drop end b) : pieces stop rest count s b = runOverlappingCount r b 0 where r = regex s replace (r,new) = rep where rep [] = [] rep (b:bs) = pieces 0 (run r b 0) where pieces 0 [] = b : rep bs pieces end [] | B.length b > end = B.drop end b : rep bs | otherwise = rep bs pieces end ((start,stop):rest) | start > end = B.take (start-end) (B.drop end b) : new : pieces stop rest | otherwise = new : pieces stop rest run :: Lexer -> B.ByteString -> Int -> [(Int,Int)] run lexerIn b offsetIn = loop offsetIn where end = B.length b loop offset | offset == end = [] | otherwise = let t@(start,stop) = lexOne b lexerIn offset in if start == -1 then [] else t : loop stop runOverlappingCount :: Lexer -> B.ByteString -> Int -> Int runOverlappingCount lexerIn b offsetIn = loop offsetIn 0 where end = B.length b loop !offset !c | offset == end = c | otherwise = let start = fst $ lexOne b lexerIn offset in if start == -1 then c else loop (succ start) (succ c) ---------------------------------------------------------------- -- Construct a regex combinator from a string regex (use Parsec) -- Designed to follow "man re_format" (Mac OS X 10.4.4) -- -- The regular expressions accepted by the program include those using -- |, empty group (), grouping with ( and ), wildcard '.', backslach -- escaped characters "\.", greedy modifiers ? + and *, bracketed -- alternatives including ranges such as [a-z0-9] and inverted -- brackets such as [^]\n-]. Only 7-bit Ascii accepted. -- 'p_regex' is the only "exported" function, used by 'regex' above p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|')) p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom)) p_atom = P.try (P.string "()" >> return epsilon) <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char <|> (pzero <?> "cannot parse regexp atom") p_post_atom atom = (P.char '?' >> return (atom `quest`)) <|> (P.char '+' >> return (atom `plus`)) <|> (P.char '*' >> return (atom `star`)) <|> (return (atom +>)) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) return $ if invert then alt ( ['\0'..'\127'] \\ characters ) else alt characters p_dot = P.char '.' >> return (alt ['\0'..'\127']) p_escaped_char = P.char '\\' >> liftM char P.anyChar p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\" ------------------------------------------------------------------------ -- And everything else is the midified CTK library. -- -- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 24 February 95, 2 March 99 -- Copyright (c) [1995..2000] Manuel M. T. Chakravarty -- Copyright (c) 2004-6 Don Stewart -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- http://www.cse.unsw.edu.au/~chak/papers/Cha99.html -- -- Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their -- helpful suggestions that improved the design of this library. -- INTERNAL IMPLEMENTATION OF REGULAR EXPRESSIONS -- tree structure used to represent the lexer table data Lexer = Lexer !Bool !Cont -- represent the continuation of a lexer -- on top of the tree, where entries are dense, we use arrays data Cont = Dense !BoundsNum !(Array Word8 Lexer) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse !BoundsNum !(M.Map Word8 Lexer) -- end of a automaton | Done type Regexp = Lexer -> Lexer infixr 4 `quest`, `star`, `plus` infixl 3 +> -- Empty lexeme (noop) epsilon = id :: Regexp -- One character regexp char c = (\l -> Lexer False (Dense (B 1 w w) (listArray (w,w) [l]))) where w = c2w c -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length ws) (minimum ws) (maximum ws) in Lexer False (aggregate bnds [(w, l) | w <- ws]) where ws = map c2w cs -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- Concatenation of regexps is just concatenation of functions (+>) = (.) :: Regexp -> Regexp -> Regexp -- disjunctive combination of two regexps, corresponding to x|y re1 >|< re2 = \l -> re1 l >||< re2 l -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `star` y corresponds to the regular expression x*y star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- Scan forwards searching for a match anywhere at or after -- startOffset. Return offsets of first matching character and after -- last matching character or (-1,-1) for failure lexOne b lexerIn !startOffset = let stop = oneLexeme lexerIn startOffset (-1) in if stop == -1 then if startOffset < end then lexOne b lexerIn (succ startOffset) else (-1,-1) else (startOffset,stop) where end = B.length b oneLexeme (Lexer win cont) !offset !last = let last' = if win then offset else last in if offset == end then last' -- at end, has to be this action else oneChar cont (B.index b offset) (succ offset) last' -- keep looking oneChar tbl !c !offset' !last = case peek tbl c of (Lexer win Done) -> if win then offset' else last l' -> oneLexeme l' offset' last peek (Dense bn arr) !c | c `inBounds` bn = arr ! c peek (Sparse bn cls) !c | c `inBounds` bn = M.findWithDefault (Lexer False Done) c cls peek _ _ = (Lexer False Done) -- disjunctive combination of two lexers (longest match, right biased) (Lexer win c) >||< (Lexer win' c') = Lexer (win || win') (joinConts c c') -- represents the number of (non-error) elements and the bounds of a -- DFA transition table data BoundsNum = B !Int !Word8 !Word8 -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- combine two disjunctive continuations joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- we use the dense representation if a table has at least the given -- number of (non-error) elements denseMin = 1 :: Int -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer False Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes')
6 Proposed Legal entry : CTK 1
I have hacked and cleaned up the Lazy Lexer 6 to run one-at-a-time to be legal under the new rules. I will submit it after it getts benchmarked here. (Runs about 6 times slower on G4; still three times faster that the all parsec entry below) -- ChrisKuklewicz
{-# OPTIONS -funbox-strict-fields #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- http://haskell.org/hawiki/ShootoutEntry -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. -- -- For more about higher-order combinator-based lexing and parsing in Haskell consult: -- -- Graham Hutton. Higher-order functions for parsing. (1992) -- Journal of Functional Programming 2: 232-343. -- -- Jeroen Fokker. Functional Parsers. (1995) -- Lecture Notes of the Baastad Spring school on Functional Programming. -- -- Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996) -- Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham. -- -- Steve Hill. Combinators for parsing expressions. (1996) -- Journal of Functional Programming 6(3): 445-463. -- -- Andrew Partridge and David Wright. -- Predictive parser combinators need four values to report errors. (1996) -- Journal of Functional Programming 6(2): 355-364. -- -- Doaitse Swierstra and Luc Duponcheel. -- Deterministic, Error-Correcting Combinator Parsers. (1996) -- Advanced Functional Programming. LNCS 1129: 185-207. -- -- Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999) -- Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138. -- -- Doaitse Swierstra and Pablo Azero. -- Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999) -- SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129. -- -- Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) -- Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183. -- -- And many other sources. -- import List (sort,nub,(\\)) import Data.Array (Array, (!), assocs, accumArray) import Control.Monad (liftM) import qualified Data.Map as M import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero) main = interact $ \stdin -> let l0 = show $ length stdin in l0 `seq` let clean = fst $ run (reDelete "\n|>[^\n]+\n") stdin 0 l1 = show $ length clean in l1 `seq` let counts = [ stringRegex++" "++(show . snd . snd $ (run (reCount stringRegex) clean 0)) | stringRegex <- variants ] -- Count the variants one at a time. l2 = show $ length (iubsExpand clean) in unlines $ counts ++ [[],l0, l1, l2] -- Substitute certain chars for patterns. The replacements are running one at a time. iubsExpand = foldl1 (.) (map (\snew input -> concat . fst $ run (reReplace snew) input 0) iubs) variants = [ "agggtaaa|tttaccct" ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] iubs = [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)") ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")] ---------------------------------------------------------------- -- Construct a regex combinator from a string regex (use Parsec) -- Designed to follow "man re_format" (Mac OS X 10.4.4) -- -- regex is the only "exported" function from this section regex = (either (error.show) id) . (\x -> P.parse p_regex x x) p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|')) p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom)) p_atom = P.try (P.string "()" >> return epsilon) <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char <|> (pzero <?> "cannot parse regexp atom") p_post_atom atom = (P.char '?' >> return (atom `quest`)) <|> (P.char '+' >> return (atom `plus`)) <|> (P.char '*' >> return (atom `star`)) <|> (return (atom +>)) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) return $ if invert then alt ( ['\0'..'\127'] \\ characters ) else alt characters p_dot = P.char '.' >> return (alt ['\0'..'\127']) p_escaped_char = P.char '\\' >> liftM char P.anyChar p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\" ------------------------------------------------------------------------ -- And everything else is the CTK library. -- -- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 24 February 95, 2 March 99 -- Copyright (c) [1995..2000] Manuel M. T. Chakravarty -- Copyright (c) 2004-6 Don Stewart -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- http://www.cse.unsw.edu.au/~chak/papers/Cha99.html -- -- Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their -- helpful suggestions that improved the design of this library. -- EXTERNAL INTERFACE infixr 4 `quest`, `star`, `plus` infixl 3 +>, `action`, `meta` infixl 2 >|<, >||< reReplace (s,new) = (regex "." `action` Just) >||< (regex s `action` const (Just new)) reDelete s = (regex "." `action` \[c] -> Just c) >||< (regex s `action` const Nothing) reCount s = regex s `meta1` \_ lexer m -> R Nothing (succ m) lexer -- Empty lexeme (noop) epsilon = id :: Regexp t -- One character regexp char c = (\l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l))) :: Regexp t -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length cs) (minimum cs) (maximum cs) in Lexer NoAction (aggregate bnds [(c, l) | c <- cs]) -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- Concatenation of regexps is just concatenation of functions (+>) = (.) :: Regexp t -> Regexp t -> Regexp t -- disjunctive combination of two regexps, corresponding to x|y re1 >|< re2 = \l -> re1 l >||< re2 l -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `star` y corresponds to the regular expression x*y -- -- The definition used below can be obtained by equational reasoning from this -- one (which is much easier to understand): -- -- star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2 -- -- However, in the above, `self' is of type `Regexp s t' (ie, a functional), -- whereas below it is of type `Lexer s t'. Thus, below we have a graphical -- body (finite representation of an infinite structure), which doesn't grow -- with the size of the accepted lexeme - in contrast to the definition using -- the functional recursion. star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- Close a regular expression into a Lexer with an action that -- converts the lexeme into a token. action and meta advance by the -- length of the lexeme while action1 and meta1 advance a single -- character. action re a = re `meta` a' where a' lexeme lexer s = R (a lexeme) s lexer {-# INLINE action #-} action1 re a = re `meta1` a' where a' lexeme lexer s = R (a lexeme) s lexer {-# INLINE action1 #-} meta re a = re (Lexer (Action a) Done) {-# INLINE meta #-} meta1 re a = re (Lexer (Action1 a) Done) {-# INLINE meta1 #-} -- disjunctive combination of two lexers (longest match, right biased) (Lexer a c) >||< (Lexer a' c') = Lexer (joinActions a a') (joinConts c c') -- Take a Lexer, input string, and initial state -- Returns ([token],(remaining input,final state)) run _ [] state = ([], ([],state)) run l csIn state = case lexOne l csIn state of (Nothing , _ , cs', state') -> run l cs' state' (Just t , l', cs', state') -> let (ts, final) = run l' cs' state'; ts' = (t:ts) in ts' `seq` (ts',final) where -- accept a single lexeme lexOne l0 cs0 state0 = oneLexeme l0 cs0 state0 id lexErr where lexErr = (Just undefined, l, tail cs0, state0) -- we implement the "principle of the longest match" by taking -- a potential result quad down (in the last argument); the -- potential result quad is updated whenever we pass by an -- action (different from `NoAction'); initially it is lexErr oneLexeme (Lexer a cont) cs state csDL last = let last' = case a of NoAction -> last; _ -> doaction a (csDL []) cs state last in case cs of [] -> last' -- at end, has to be this action (c:cs') -> last' `seq` oneChar cont c cs' state csDL last' -- keep looking -- There are more chars. Look at the next one. Now, if the -- next tbl is Done, then there is no next transition, so -- immediately execute the matching action. oneChar tbl c cs state csDL last = case peek tbl c of (Lexer action Done) -> doaction action (csDL [c]) cs state last l' -> oneLexeme l' cs state (csDL . (c:)) last -- Do the lookup of the current character in the DFA transition table. peek (Dense bn arr) c | c `inBounds` bn = arr ! c peek (Sparse bn cls) c | c `inBounds` bn = M.findWithDefault (Lexer NoAction Done) c cls peek _ _ = (Lexer NoAction Done) -- execute the action if present and finalise the current lexeme doaction (Action f) csDL cs state _ = case f (csDL) l0 state of (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' cs s' (R res s' l') -> s' `seq` (res, l', cs, s') doaction (Action1 f) csDL cs state _ = case f (csDL) l0 state of (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' (tail csIn) s' (R res s' l') -> s' `seq` (res, l', (tail csIn), s') doaction NoAction _ _ _ last = last -- INTERNAL IMPLEMENTATION -- represents the number of (non-error) elements and the bounds of a -- DFA transition table data BoundsNum = B !Int !Char !Char -- we use the dense representation if a table has at least the given -- number of (non-error) elements denseMin = 20 :: Int -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- Lexical actions take a lexeme with its position and may return a -- token; in a variant, an error can be returned -- -- * if there is no token returned, the current lexeme is discarded -- lexing continues looking for a token type Action t = String -> Lexer t -> Maybe t -- Meta actions transform the lexeme, the old lexer, and a -- user-defined state; they may return the old or a new lexer, which -- is then used for accepting the next token (this is important to -- implement non-regular behaviour like nested comments) type Meta t = String -> Lexer t -> S -> Result t data Result t = R (Maybe t) !S !(Lexer t) -- threaded top-down during lexing (with current input) type S = Int -- tree structure used to represent the lexer table -- -- * each node in the tree corresponds to a state of the lexer; the -- associated actions are those that apply when the corresponding -- state is reached data Lexer t = Lexer (LexAction t) (Cont t) -- represent the continuation of a lexer -- on top of the tree, where entries are dense, we use arrays data Cont t = Dense BoundsNum (Array Char (Lexer t)) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse BoundsNum (M.Map Char (Lexer t)) -- end of a automaton | Done -- lexical action data LexAction t = Action !(Meta t) | Action1 !(Meta t) | NoAction -- need new LexAction for advance by 1 -- a regular expression type Regexp t = Lexer t -> Lexer t -- combine two disjunctive continuations joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- combine two actions. Use the latter in case of overlap (right biased!) joinActions NoAction a' = a' joinActions a NoAction = a joinActions _ a' = a' -- error "Lexers.>||<: Overlapping actions!" -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes')
7 Proposed Legal entry : Parsec
7.1 Parsec 1
I (ChrisKuklewicz) have resurrected the "All parsec" proposal and crafted this entry. It uses parsec to transform a string regular expression into a new parsec parser that accepts matching strings (as a prefix). This is used in reCount and in (reReplace, reReplace1, reReplaceMap) to perform the requisite operations.
Compile with -O2. On a powerbook G4 it uses 177 MB RSIZE and runs in 390 seconds (aka 6.5 minutes).
Note: The reReplaceMap used by super to perform the substitution of the IUBS is a possibly questionable optimization.
THIS HAS BEEN WITHDRAWN The parser it created was "possessive" where it should have been "greedy" The offending code was type T s = CharParser s () type CT s = CharParser () (T s) p_post_atom :: T s -> CT s p_post_atom atom = (char '?' >> return (optional atom)) <|> (char '+' >> return (skipMany1 atom)) <|> (char '*' >> return (skipMany atom)) <|> return atom
7.2 Parsec 2
This is a "fix"ed version of Parsec 1. It uses "star x cont = fix (\this -> try (x this) <|> cont)" to create all the backtacking points that are needed. To make this work, the parsers return CPS-style functions instead of parsers.
The fixed code is
-- February 25th, 2006 type T s = CharParser s () type TT s = T s -> T s type CTT s = CharParser () (TT s) p_post_atom :: TT s -> CTT s p_post_atom atom = (char '?' >> return (\cont -> try (atom cont) <|> cont)) <|> (char '+' >> return (atom . star atom)) <|> (char '*' >> return (star atom)) <|> return atom where star :: TT s -> T s -> T s star x cont = fix (\this -> try (x this) <|> cont)
All other changes were needed to handle to conversion to CPS-style, but did not alter the process of matching the regex.
-- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- For the regex-dna benchmark -- -- http://haskell.org/hawiki/ShootoutEntry -- -- March 3rd, 2006 -- -- Pure parsec solution -- by Chris Kuklewicz and Don Stewart -- -- This satisfies the new stringent specification -- import Control.Monad.Fix (fix) import Control.Monad (liftM,liftM2,msum,sequence) import Data.Either (either) import Data.List -- (intersperse,nub,sort,foldl') import Text.ParserCombinators.Parsec import Debug.Trace main = interact $ \input -> let length1 = show $ length input in length1 `seq` let clean = reRemove ">[^\n]*\n|\n" input length2 = show $ length clean in length2 `seq` let table = map (\regexp -> regexp ++ (' ':show (reCount regexp clean))) variants -- length3 = show $ length (super clean) -- The foldl' below is more orthodox, but slower length3 = show $ length $ foldl' reReplace3 clean iubs in unlines $ table ++ ["",length1,length2,length3] variants = ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] iubs = map (\ (c,new) -> (parseRegexp c, c, new) ) $ [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)"),("M","(a|c)") ,("N","(a|c|g|t)"),("R","(a|g)"),("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)"),("Y","(c|t)")] -- clean is a bit more clever, searching for each replacement in turn. -- Each reReplaceMap takes a continuation to use for input that does -- not match, each continuation leads to the next reReplaceMap (ending -- in 'id' ). This is not searching in parallel! super = foldr1 (.) (map reReplaceMap iubs) id -- Regexp INTERFACE -- Replace each match of regexp with new in the input, processing everything before returning -- (Used to clean in the input string) reReplace regexp new input = let regexp' = parseRegexp regexp parser = sepBy (many (notFollowedBy (try (regexp' >> return '\0')) >> anyChar)) regexp' result = either (error.show) id (parse parser regexp input) in concat $ intersperse new result -- Replace each match of regexp with new in the input, returning the matching or non-matching blocks lazily -- (Currently unused) reReplace1 regexp new s = let regexp' = parseRegexp regexp parser = do value <- (try regexp' >> return new) <|> do c <- anyChar cs <- many (notFollowedBy (try (regexp' >> return '\0')) >> anyChar) return (c:cs) rest <- getInput return (value,rest) helper "" = "" helper input = let (this,next) = either (error.show) id (parse parser regexp input) in this ++ helper next in helper s takeAppend :: String -> Int -> String -> String takeAppend s n e = loop s n where loop _ 0 = e loop (x:xs) n = x : loop xs (pred n) {-# INLINE takeAppend #-} takeS :: String -> Int -> ShowS -> ShowS takeS s n e x = loop s n -- trace ("\n>"++show n++"<\n") loop s n where loop _ 0 = e x loop (x:xs) n = x : loop xs (pred n) reReplace2 regexp new s = let newS = shows new regexp' = parseRegexp regexp parser = do start <- getInput let loop n | n `seq` True = (try regexp' >> (return (takeAppend start n new))) <|> (anyChar >> loop (succ n)) <|> (eof >> return start) liftM2 (,) (loop (0::Int)) getInput helper "" = "" helper input = let (this,next) = either (error.show) id (parse parser regexp input) in this ++ helper next in helper s reRemove regexp s = reReplace3 s (parseRegexp regexp,regexp,"") reReplace3 s (regexp',regexp,new) = let newS = (new++) parser = do start <- getInput let loop n | n `seq` True = (try regexp' >> updateState (. (takeS start n newS)) >> parser) <|> (anyChar >> loop (succ n)) <|> (eof >> getState >>= \f -> return (f . (const start))) loop 0 result = either (error.show) id (runParser parser id regexp s) $ "" in result -- Count the number of non-overlapping matches of regexp in the input -- (Used to count each of the sequences) reCount regexp input = let regexp' = parseRegexp regexp parser = (try regexp' >> updateState (+1) >> parser) <|> (anyChar >> parser) <|> (eof >> getState) in either (error.show) id (runParser parser (0::Int) regexp input) reCount' regexp input = let regexp' = parseRegexp regexp parser = (eof >> getState) <|> ( ( (try regexp' >> updateState (+1)) <|> (anyChar >> return ()) ) >> parser ) in either (error.show) id (runParser parser (0::Int) regexp input) -- When regexp' matches replace it with new otherwise pass the block the non-matching block to cont -- (Used by clean to replace the IUBS codes) reReplaceMap (regexp',regexp,new) cont input = let parser = do value <- (regexp' >> return new) <|> do c <- anyChar cs <- many (notFollowedBy (try (regexp' >> return '\0')) >> anyChar) return (cont (c:cs)) rest <- getInput return (value,rest) helper "" = "" helper rest = let (this,next) = either (error.show) id (parse parser regexp rest) in this ++ helper next in helper input -- Turn a string regexp into a Parsec parser type T s = CharParser s () type TT s = T s -> T s type CTT s = CharParser () (TT s) parseRegexp :: String -> T s parseRegexp = (either (error.show) ($ return ())) . (\x -> parse p_regexp x x) -- Regexp IMPLEMENTATION (exporting p_regexp to Rexexp INTERFACE) p_regexp :: CTT s p_regexp = do branches <- sepBy1 p_branch (char '|') return (\cont -> let branches' = map (\branch -> try (branch cont)) branches in msum branches') p_branch :: CTT s p_branch = return . (foldr1 (.)) =<< many1 p_piece p_piece :: CTT s p_piece = p_atom >>= p_post_atom p_atom :: CTT s p_atom = try (string "()" >> return id) <|> between (char '(') (char ')') p_regexp <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char p_post_atom :: TT s -> CTT s p_post_atom atom = (char '?' >> return (\cont -> try (atom cont) <|> cont)) <|> (char '+' >> return (atom . star atom)) <|> (char '*' >> return (star atom)) <|> return atom where star :: TT s -> T s -> T s star x cont = fix (\this -> try (x this) <|> cont) p_bracket :: CTT s p_bracket = char '[' >> ( (char '^' >> p_set True) <|> (p_set False) ) p_set :: Bool -> CTT s p_set invert = do initial <- option "" ((char ']' >> return "]") <|> (char '-' >> return "-")) middle <- manyTill anyChar (char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) -- can't be empty if invert then use (noneOf characters) else use (oneOf characters) p_dot :: CTT s p_dot = char '.' >> use anyChar p_escaped_char :: CTT s p_escaped_char = char '\\' >> anyChar >>= use . char p_other_char :: CTT s p_other_char = label (noneOf specials) ("none of "++show specials) >>= use . char where specials = "^.[$()|*+?\\" -- use is needed to alter the returned type of character parsers to (TT s) use :: CharParser s ignore -> CTT s use x = return (x >>)
8 Alternatives
8.1 Speedup Idea - precalculate prefixes
precalculate regexp prefixes to drop more than one character on pattern failure.
at the moment, when a pattern match "try" fails, one character is dropped and the pattern is tried again. this leads to roughly "input length" trials for each of the patterns and each occurring "|"; making it (9 patterns * 2 ORs * input length = 9*2*100000 =) 1,800,000 applications of "try" even for the small input file.
{- find the maximal number of characters to skip when a regular expression parser just failed. when we know after how many characters a parser failed and where the first pattern may appear in a valid pattern, the minimum of these numbers should be the number of characters we can skip, if i'm not mistaken. NOTE: code is not fully tested and substring is not implemented for strings of different lengths. Contributed by: Johannes Ahlmann -} import Data.List import Text.ParserCombinators.Parsec -- split input regular expression into [[Char]] to be later expanded by sequence -- NOTE: only supports non-negated brackets and "other" chars, yet p_regexp = sepBy1 p_branch (char '|') p_bracket = between (char '[') (char ']') (many1 $ noneOf "]") p_branch = many1 ((try p_bracket) <|> (noneOf specials >>= return . (:[]))) where specials = "^.[$()|*+?\\" -- substring function that also returns at what position "a" is contained in "b" containsAt a b = if length b >= length a then length b else error "not implemented" -- function to determine if and where "a" or one of its prefixes is the postfix of "b" tailsAt a b = if null tl then lb else lb - (maximum tl) where tl = [length y | x <- (drop 1 . inits) a, y <- tails b, x == y] lb = length b allowsSkip a b = min (tailsAt a b) (containsAt a b) -- calculate the minimum number of characters to skip for all expanded variants of a regexp skipper vars = if null skipLengths then minimum (map length vars) else minimum skipLengths where skipLengths = map (uncurry allowsSkip) [(a,b) | a <- vars, b <- vars, a /= b] myparse = either (error.show) id . parse p_regexp "" expand = concatMap sequence . myparse -- calculate the number of characters to skip, such that no other pattern could fit into them. skipLength regexp = skipper $ expand regexp
8.2 How to gauage where the error occured
Catching errors inside a parser is impossible. You have to call runParser or parse to catch the error. And the "try" combinator hides the error position (it resets to 1st column). So I have invented a trick:
-- by ChrisKuklewicz -- Proposed, untested trick to replace msum for p_regexp definition -- Returns the Left Column for failure -- Right Column for success sub p = do i <- getInput let e = parse p "sub" i case e of Left pe -> do let c = sourceColumn (errorPos pe) return (Left c) Right _ -> do c <- liftM sourceColumn getPosition return (Right c) -- Returns the Left Column for failures (minimum of two failure Columns) -- Right Column for a success (use first success, left biased) orSub p1 p2 = do ec1 <- p1 case ec1 of Left c1 -> do ec2 <- p2 case ec2 of Left c2 -> return (Left (min c1 c2)) right -> return right right -> return right msumSub [x] = (sub x) msumSub xs = foldr1 orSub (map sub xs)
At a higher level, we want to advance 1 character on success (Right _) or use the Speedup Idea from Johannes to decide what to skip on failure.
Another strategy is to modify {{{use}}} and replace {{{sequence_}}} in {{{p_branch}}} by {{{series}}} and {{{msum}}} in {{{p_regexp}}} by {{{msumSub}}}:
-- Never tested: -- Don't let the parser fail, capture the error column in the return value explicitly -- Note that x is a parser for exactly one character use x = return ( (x >> liftM Right getPosition) <|> (liftM Left getPosition) ) -- series replaces the (>>) that sequence_ folds into the list with different abort semantics series [x] = x series (x:xs) = do ec <- x case ec of Right _ -> series xs left -> return left
Hmmm...I like this one better, but {{{p_post_atom}}} with need similar fixing {{{(myUnit,myOptional,mySkipMany1,mySkipMany)}}}.
-- All I can say is it compiles; it has never been tested -- replace return () for "()" atoms myUnit = liftM Right getPosition -- Replace optional for '?' modifier myOptional p = do missed <- liftM Right getPosition ec <- p case ec of Left _ -> return missed right -> return right -- mySkip is Helper for mySkipMany and mySkipMany1 mySkip prev = do ec <- p case ec of Left _ -> return prev right -> loop right -- Replace skipMany for '*' modifier mySkipMany p = liftM (mySkip.Right) getPosition -- Replace skipMany1 for '+' modifier mySkipMany1 p = do ec <- p case ec of right@(Right _) -> mySkip right left -> return left
Re-inventing all the parsec combinators is tricky, but it just may work...
8.3 REs using TCL library
Yet without cleaning or substition.
On my computer it takes 13s to count the Regular Expressions matches on the big data set (N=500000) where the Ruby version takes 7s.
Obviously there are many things to improve/change. At the moment the input is treated as one large CString, which is not a valid approach for the not yet implemented substitution. Problems with dependence on C2Hs being installed and potential issues of legality also exist.
As it seems, TCL uses "Henry Spencer's RE library" which will most likely be easier/better to bind, because the TCL library is pretty much made to measure for TCL and not supposed to be used for external usage.
Profiling showed that 92% of the time is spent in "reExec", therefore all attempts at optimizing the Haskell portion of code are rather useless. Maybe there are some tweaks to get more performance out of the FFI, though.
{- | Regular Expressions with the TCL RE library. TCL development header have to be installed. To compile: c2hs /usr/include/tcl8.4/tcl-private/generic/regcustom.h T.chs ghc -ffi -ltcl8.4 --make T.hs Contributed by Johannes Ahlmann -} module Main where import Foreign import Foreign.C -- | find the relative start and end of a match. assumes at least one match was found getMatchExtent re = do info <- mallocBytes {#sizeof Tcl_RegExpInfo#} -- little speedup in extracting reInfo re info start <- {#get Tcl_RegExpInfo.matches->start#} info end <- {#get Tcl_RegExpInfo.matches->end#} info free info return (start,end) -- | count occurrences of regular expression in string -- FIXME: maybe use "end" instead of "start+1" for validity (overlapping matches). speedwise it makes no difference. countOccs re cString n = do m <- reExec nullPtr re cString cString -- real start if different!?? if m /= 1 then return n else do (start,end) <- getMatchExtent re countOccs re (advancePtr cString (fromIntegral $ start + 1)) (n+1) --- rExpressions = ["agggtaaa|tttaccct" ,"[cgt]gggtaaa|tttaccc[acg]" ,"a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct" ,"agg[act]taaa|ttta[agt]cct" ,"aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct" ,"agggta[cgt]a|t[acg]taccct" ,"agggtaa[cgt]|[acg]ttaccct" ] -- | given a string and a regular expression, count the REs number of occurences counter s rx = withCString rx (\p -> do re <- reCompile nullPtr p n <- countOccs re s 0 return (rx, n)) main = do input <- getContents counts <- withCString input (\s -> mapM (counter s) rExpressions) mapM (\(e,c) -> putStrLn (e ++ " " ++ show c)) counts reCompile = {#call unsafe Tcl_RegExpCompile as ^ #} reExec = {#call unsafe Tcl_RegExpExec as ^ #} reInfo = {#call unsafe Tcl_RegExpGetInfo as ^ #}
9 Older illegal entries
9.1 Possible backup entry : lazy regex 5
Lazy regex 4, but with the parsec frontend to construct combinator expressions from strings dynamically (I guess we could use TH too).
See this thread for people complaining about combinator syntax for regexes.
It would be nice to see a Text.ParserCombnators.Regex made from this inteface..
{-# OPTIONS -funbox-strict-fields #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- http://haskell.org/hawiki/ShootoutEntry -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. -- -- For more about higher-order combinator-based lexing and parsing in Haskell consult: -- -- Graham Hutton. Higher-order functions for parsing. (1992) -- Journal of Functional Programming 2: 232-343. -- -- Jeroen Fokker. Functional Parsers. (1995) -- Lecture Notes of the Baastad Spring school on Functional Programming. -- -- Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996) -- Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham. -- -- Steve Hill. Combinators for parsing expressions. (1996) -- Journal of Functional Programming 6(3): 445-463. -- -- Andrew Partridge and David Wright. -- Predictive parser combinators need four values to report errors. (1996) -- Journal of Functional Programming 6(2): 355-364. -- -- Doaitse Swierstra and Luc Duponcheel. -- Deterministic, Error-Correcting Combinator Parsers. (1996) -- Advanced Functional Programming. LNCS 1129: 185-207. -- -- Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999) -- Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138. -- -- Doaitse Swierstra and Pablo Azero. -- Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999) -- SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129. -- -- Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) -- Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183. -- -- And many other sources. -- import Prelude hiding (last) import List import Maybe import Data.Array (Array, (!), assocs, accumArray) import qualified Data.Map as M import qualified Data.IntMap as I import Control.Monad import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),GenParser) main = interact $ \s0 -> let l0 = length s0 in l0 `seq` let s1 = fst $ run clean (LS s0 I.empty) l1 = length s1 in l1 `seq` let (LS _ m) = snd $ run count (LS s1 I.empty) (LS _ n) = snd $ run count' (LS s1 I.empty) counts = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i m)) $ (init variants) counts' = (snd.last) variants++" "++show (I.findWithDefault 0 0 n) l2 = length . concat . fst $ run replace (LS s1 I.empty) in unlines $ (counts++[counts']) ++ [[]] ++ map show [l0, l1, l2] -- remove "\n" and lines with ">" clean = (regex "." `action` \[c] -> Just c) >||< (regex "\n|>[^\n]+\n" `action` const Nothing) -- count all variants, accumulating count in state threaded through regex matcher count = foldl1 (>||<) $ [ match (regex s, n) | (n,s) <- variants ] count' = match (regex ((snd.last) variants), 0) -- obscure issue with overlapping patterns -- each pattern keeps track of its own count, so we can perform a single pass match (p,i) = p `meta` \_ m -> R Nothing (I.insertWith (+) i 1 m) Nothing variants = zip [0..] ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] -- substitute certain chars for patterns replace = (regex "." `action` \c -> Just c) >||< foldl1 (>||<) (map (\(c,p) -> regex [c] `action` const (Just p)) pairs) pairs = [('B',"(c|g|t)"),('D',"(a|g|t)"), ('H',"(a|c|t)"),('K',"(g|t)") ,('M',"(a|c)"), ('N',"(a|c|g|t)"),('R',"(a|g)"), ('S',"(c|g)") ,('V',"(a|c|g)"),('W',"(a|t)"), ('Y',"(c|t)") ] ------------------------------------------------------------------------ -- -- Construct a regex combinator expression from a string regex -- -- I know, I know, in Haskell we'd just write the combinator expression -- directly, but people complained that they're not `real' regular -- expressions. That's clearly wrong in my opinion -- combinators -- accurately describe and implement regular expressions (they're -- semantically identical). The syntax is an embedded domain-specific -- language in Haskell, not an untyped string. Sigh. -- regex = (either (error.show) id) . (\x -> P.parse p_regex x x) p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|')) p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom)) p_atom = P.try (P.string "()" >> return epsilon) -- must not consume '(' yet <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char p_post_atom atom = (P.char '?' >> return (atom `quest`)) <|> (P.char '+' >> return (atom `plus`)) <|> (P.char '*' >> return (atom `star`)) <|> (return (atom +>)) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) return $ if invert then alt ( ['\0'..'\127'] \\ characters ) else alt characters p_dot = P.char '.' >> return (alt ['\0'..'\127']) p_escaped_char = P.char '\\' >> liftM char P.anyChar p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\" ------------------------------------------------------------------------ -- And everything else is the CTK library. -- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 24 February 95, 2 March 99 -- Copyright (c) [1995..2000] Manuel M. T. Chakravarty -- Copyright (c) 2004-6 Don Stewart -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- http://www.cse.unsw.edu.au/~chak/papers/Cha99.html -- -- Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their -- helpful suggestions that improved the design of this library. -- infixr 4 `quest`, `star`, `plus` infixl 3 +>, `action`, `meta` infixl 2 >|<, >||< -- we use the dense representation if a table has at least the given number of -- (non-error) elements denseMin :: Int denseMin = 20 -- represents the number of (non-error) elements and the bounds of a table data BoundsNum = B !Int !Char !Char -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- Lexical actions take a lexeme with its position and may return a token; in -- a variant, an error can be returned -- -- * if there is no token returned, the current lexeme is discarded lexing -- continues looking for a token type Action t = String -> Maybe t -- Meta actions transform the lexeme, and a user-defined state; they -- may return a lexer, which is then used for accepting the next token -- (this is important to implement non-regular behaviour like nested -- comments) type Meta t = String -> S -> Result t data Result t = R (Maybe t) !S (Maybe (Lexer t)) -- threaded top-down during lexing (current input, meta state) data LexerState = LS !String !S type S = I.IntMap Int -- tree structure used to represent the lexer table -- -- * each node in the tree corresponds to a state of the lexer; the associated -- actions are those that apply when the corresponding state is reached data Lexer t = Lexer (LexAction t) (Cont t) -- represent the continuation of a lexer data Cont t = -- on top of the tree, where entries are dense, we use arrays Dense BoundsNum (Array Char (Lexer t)) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse BoundsNum (M.Map Char (Lexer t)) | Done -- end of a automaton -- lexical action data LexAction t = Action !(Meta t) | NoAction -- a regular expression type Regexp t = Lexer t -> Lexer t -- Empty lexeme epsilon :: Regexp t epsilon = id -- One character regexp char :: Char -> Regexp t char c = \l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l)) -- Concatenation of regexps (+>) :: Regexp t -> Regexp t -> Regexp t (+>) = (.) -- Close a regular expression with an action that converts the lexeme into a -- token -- -- * Note: After the application of the action, the position is advanced -- according to the length of the lexeme. This implies that normal -- actions should not be used in the case where a lexeme might contain -- control characters that imply non-standard changes of the position, -- such as newlines or tabs. -- action re a = re `meta` a' where a' lexeme s = R (a lexeme) s Nothing {-# INLINE action #-} -- Close a regular expression with a meta action -- -- * Note: Meta actions have to advance the position in dependence of the -- lexeme by themselves. -- meta re a = re (Lexer (Action a) Done) {-# INLINE meta #-} -- disjunctive combination of two regexps re >|< re' = \l -> re l >||< re' l -- disjunctive combination of two lexers (Lexer a c) >||< (Lexer a' c') = Lexer (joinActions a a') (joinConts c c') -- combine two disjunctive continuations -- joinConts :: Cont t -> Cont t -> Cont t joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- combine two actions. Use the latter in case of overlap (!) joinActions NoAction a' = a' joinActions a NoAction = a joinActions _ a' = a' -- error "Lexers.>||<: Overlapping actions!" -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes') -- x `star` y corresponds to the regular expression x*y -- -- The definition used below can be obtained by equational reasoning from this -- one (which is much easier to understand): -- -- star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2 -- -- However, in the above, `self' is of type `Regexp s t' (ie, a functional), -- whereas below it is of type `Lexer s t'. Thus, below we have a graphical -- body (finite representation of an infinite structure), which doesn't grow -- with the size of the accepted lexeme - in contrast to the definition using -- the functional recursion. star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length cs) (minimum cs) (maximum cs) in Lexer NoAction (aggregate bnds [(c, l) | c <- cs]) -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- apply a lexer, yielding a token sequence and a list of errors -- -- * Currently, all errors are fatal; thus, the result is undefined in case of -- an error (this changes when error correction is added). -- * The final lexer state is returned. -- * The order of the error messages is undefined. -- * the following is moderately tuned -- run _ st@(LS [] _) = ([], st) run l st = case lexOne l st of (Nothing , _ , st') -> run l st' (Just t, l', st') -> let (ts, final) = run l' st'; ts' = (t:ts) in ts' `seq` (ts',final) where -- accept a single lexeme lexOne l0 st'@(LS cs s) = oneLexeme l0 st' id lexErr where lexErr = (Just undefined, l, (LS (tail cs) s)) -- we take an open list of characters down, where we accumulate the -- lexeme; this function returns maybe a token, the next lexer to use -- (can be altered by a meta action), the new lexer state, and a list -- of errors -- -- we implement the "principle of the longest match" by taking a -- potential result quadruple down (in the last argument); the -- potential result quadruple is updated whenever we pass by an action -- (different from `NoAction'); initially it is an error result -- oneLexeme (Lexer a cont) state@(LS cs s) csDL last = let last' = doaction a csDL state last in case cs of [] -> last' -- at end, has to be this action (c:cs') -> oneChar cont c (LS cs' s) csDL last' -- keep looking -- There are more chars. Look at the next one -- Now, if the next tbl is Done, then there is no more -- transition, so immediately execute our action oneChar tbl c state csDL last = case peek tbl c of Nothing -> last Just (Lexer a Done) -> doaction a (\l -> csDL (c:l)) state last Just l' -> oneLexeme l' state (\l -> csDL (c:l)) last -- Do the lookup. peek (Dense bn arr) c | c `inBounds` bn = Just $ arr ! c peek (Sparse bn cls) c | c `inBounds` bn = M.lookup c cls peek _ _ = Nothing -- execute the action if present and finalise the current lexeme doaction (Action f) csDL (LS cs s) _ = case f (($[]) csDL) s of (R Nothing s' l') | not . null $ cs -> lexOne (fromMaybe l0 l') (LS cs s') (R res s' l') -> (res, (fromMaybe l0 l'), (LS cs s')) doaction NoAction _ _ last = last
9.2 Current entry : lazy regex 4
Uses regex-only transformations, and is thus a legal version of "Lazy regex 3". Also works around a as yet unidentified bug in the folded version of countMatches. Look for the old-skool `interact' :) This has been submitted. Space usage could be better though.
{-# OPTIONS -funbox-strict-fields #-} -- -- The Computer Language Shootout - http://shootout.alioth.debian.org/ -- Haskell Wiki page for Shootout entries - http://haskell.org/hawiki/ShootoutEntry -- -- Contributed by Don Stewart -- Haskell RegEx C bindings are slow (see the other GHC entry), so this shows -- that Haskell can compete with a purely functional entry based on lazy -- regex combinators as detailed in the RegEx section. import Prelude hiding (last) import List import Maybe import Data.Array (Array, (!), assocs, accumArray) import qualified Data.Map as M import qualified Data.IntMap as I main = interact $ \s0 -> let l0 = length s0 in l0 `seq` let s1 = fst $ run clean (LS s0 I.empty) l1 = length s1 in l1 `seq` let (LS _ m) = snd $ run count (LS s1 I.empty) (LS _ n) = snd $ run count' (LS s1 I.empty) counts = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i m)) $ (init variants) counts' = (snd.last) variants++" "++show (I.findWithDefault 0 0 n) -- work around for count' l2 = length . concat . fst $ run replace (LS s1 I.empty) in unlines $ (counts++[counts']) ++ [[]] ++ map show [l0, l1, l2] -- remove "\n" and lines with ">" clean = (dot `action` \[c] -> Just c) >||< (char '\n' `action` const Nothing) >||< (char '>' +> anyButNL `star` char '\n' `action` const Nothing) -- count all variants, accumulating count in state threaded through regex matcher -- note: count' is a workaround for an obscure problem with overlapping patterns(?) count = (dot `action` const Nothing) >||< (foldl1 (>||<) $ map match (zip pats [0..])) where pats = [(string "agggtaaa" >|< string "tttaccct") ,((alt "cgt" +> string "gggtaaa") >|< (string "tttaccc" +> alt "acg")) ,((char 'a' +> alt "act" +> string "ggtaaa") >|< (string "tttacc" +> alt "agt" +> t)) ,((string "ag" +> alt "act" +> string "gtaaa") >|< (string "tttac" +> alt "agt" +> string "ct")) ,((string "agg" +> alt "act" +> string "taaa") >|< (string "ttta" +> alt "agt" +> string "cct")) ,((string "aggg" +> alt "acg" +> string "aaa") >|< (string "ttt" +> alt "cgt" +> string "ccct")) ,((string "agggt" +> alt "cgt" +> string "aa") >|< (string "tt" +> alt "acg" +> string "accct")) ,((string "agggta" +> alt "cgt" +> a) >|< (char 't' +> alt "acg" +> string "taccct"))] count' = match (((string "agggtaa" +> alt "cgt") >|< (alt "acg" +> string "ttaccct")),0) -- substitute certain chars for patterns replace = (alt (['a'..'z'] ++ ['A'..'Z']) `action` \c -> Just c) >||< foldl1 (>||<) (map (\(c,p) -> char c `action` const (Just p)) pairs) -- Utility functions and constants -- each pattern keeps track of its own count, so we can perform a single pass match (p,i) = p `meta` \_ m -> R Nothing (I.insertWith (+) i 1 m) Nothing dot = alt ['\32' .. '\122'] -- defines the "." RegEx of any character anyButNL = alt $ ['\32' .. '\122'] \\ ['\n'] -- same as "." but excludes newline variants = zip [0..] ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] pairs = [('B',"(c|g|t)"),('D',"(a|g|t)"), ('H',"(a|c|t)"),('K',"(g|t)") ,('M',"(a|c)"), ('N',"(a|c|g|t)"),('R',"(a|g)"), ('S',"(c|g)") ,('V',"(a|c|g)"),('W',"(a|t)"), ('Y',"(c|t)") ] ------------------------------------------------------------------------ -- And now the regex library ------------------------------------------------------------------------ -- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 24 February 95, 2 March 99 -- Copyright (c) [1995..2000] Manuel M. T. Chakravarty -- Copyright (c) 2004-6 Don Stewart -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- http://www.cse.unsw.edu.au/~chak/papers/Cha99.html -- -- Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their -- helpful suggestions that improved the design of this library. -- infixr 4 `quest`, `star`, `plus` infixl 3 +>, `action`, `meta` infixl 2 >|<, >||< -- we use the dense representation if a table has at least the given number of -- (non-error) elements denseMin :: Int denseMin = 20 -- represents the number of (non-error) elements and the bounds of a table data BoundsNum = B !Int !Char !Char -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- Lexical actions take a lexeme with its position and may return a token; in -- a variant, an error can be returned -- -- * if there is no token returned, the current lexeme is discarded lexing -- continues looking for a token type Action t = String -> Maybe t -- Meta actions transform the lexeme, and a user-defined state; they -- may return a lexer, which is then used for accepting the next token -- (this is important to implement non-regular behaviour like nested -- comments) type Meta t = String -> S -> Result t data Result t = R (Maybe t) !S (Maybe (Lexer t)) -- threaded top-down during lexing (current input, meta state) data LexerState = LS !String !S type S = I.IntMap Int -- tree structure used to represent the lexer table -- -- * each node in the tree corresponds to a state of the lexer; the associated -- actions are those that apply when the corresponding state is reached data Lexer t = Lexer (LexAction t) (Cont t) -- represent the continuation of a lexer data Cont t = -- on top of the tree, where entries are dense, we use arrays Dense BoundsNum (Array Char (Lexer t)) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse BoundsNum (M.Map Char (Lexer t)) | Done -- end of a automaton -- lexical action data LexAction t = Action !(Meta t) | NoAction -- a regular expression type Regexp t = Lexer t -> Lexer t -- Empty lexeme epsilon :: Regexp t epsilon = id -- One character regexp char :: Char -> Regexp t char c = \l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l)) -- Concatenation of regexps (+>) :: Regexp t -> Regexp t -> Regexp t (+>) = (.) -- Close a regular expression with an action that converts the lexeme into a -- token -- -- * Note: After the application of the action, the position is advanced -- according to the length of the lexeme. This implies that normal -- actions should not be used in the case where a lexeme might contain -- control characters that imply non-standard changes of the position, -- such as newlines or tabs. -- action re a = re `meta` a' where a' lexeme s = R (a lexeme) s Nothing {-# INLINE action #-} -- Close a regular expression with a meta action -- -- * Note: Meta actions have to advance the position in dependence of the -- lexeme by themselves. -- meta re a = re (Lexer (Action a) Done) {-# INLINE meta #-} -- disjunctive combination of two regexps re >|< re' = \l -> re l >||< re' l -- disjunctive combination of two lexers (Lexer a c) >||< (Lexer a' c') = Lexer (joinActions a a') (joinConts c c') -- combine two disjunctive continuations -- joinConts :: Cont t -> Cont t -> Cont t joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- combine two actions. Use the latter in case of overlap (!) joinActions NoAction a' = a' joinActions a NoAction = a joinActions _ a' = a' -- error "Lexers.>||<: Overlapping actions!" -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes') -- x `star` y corresponds to the regular expression x*y -- -- The definition used below can be obtained by equational reasoning from this -- one (which is much easier to understand): -- -- star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2 -- -- However, in the above, `self' is of type `Regexp s t' (ie, a functional), -- whereas below it is of type `Lexer s t'. Thus, below we have a graphical -- body (finite representation of an infinite structure), which doesn't grow -- with the size of the accepted lexeme - in contrast to the definition using -- the functional recursion. star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length cs) (minimum cs) (maximum cs) in Lexer NoAction (aggregate bnds [(c, l) | c <- cs]) -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- apply a lexer, yielding a token sequence and a list of errors -- -- * Currently, all errors are fatal; thus, the result is undefined in case of -- an error (this changes when error correction is added). -- * The final lexer state is returned. -- * The order of the error messages is undefined. -- * the following is moderately tuned -- run _ st@(LS [] _) = ([], st) run l st = case lexOne l st of (Nothing , _ , st') -> run l st' (Just t, l', st') -> let (ts, final) = run l' st'; ts' = (t:ts) in ts' `seq` (ts',final) where -- accept a single lexeme lexOne l0 st'@(LS cs s) = oneLexeme l0 st' id lexErr where lexErr = (Just undefined, l, (LS (tail cs) s)) -- we take an open list of characters down, where we accumulate the -- lexeme; this function returns maybe a token, the next lexer to use -- (can be altered by a meta action), the new lexer state, and a list -- of errors -- -- we implement the "principle of the longest match" by taking a -- potential result quadruple down (in the last argument); the -- potential result quadruple is updated whenever we pass by an action -- (different from `NoAction'); initially it is an error result -- oneLexeme (Lexer a cont) state@(LS cs s) csDL last = let last' = doaction a csDL state last in case cs of [] -> last' -- at end, has to be this action (c:cs') -> oneChar cont c (LS cs' s) csDL last' -- keep looking -- There are more chars. Look at the next one -- Now, if the next tbl is Done, then there is no more -- transition, so immediately execute our action oneChar tbl c state csDL last = case peek tbl c of Nothing -> last Just (Lexer a Done) -> doaction a (\l -> csDL (c:l)) state last Just l' -> oneLexeme l' state (\l -> csDL (c:l)) last -- Do the lookup. peek (Dense bn arr) c | c `inBounds` bn = Just $ arr ! c peek (Sparse bn cls) c | c `inBounds` bn = M.lookup c cls peek _ _ = Nothing -- execute the action if present and finalise the current lexeme doaction (Action f) csDL (LS cs s) _ = case f (($[]) csDL) s of (R Nothing s' l') | not . null $ cs -> lexOne (fromMaybe l0 l') (LS cs s') (R res s' l') -> (res, (fromMaybe l0 l'), (LS cs s')) doaction NoAction _ _ last = last
9.3 Lazy regex 3
We build up a clever countMatches pass, that walks the input only once. Each variant keeps track of its own count. Lovely, functional, expressive, and, as Chris says, "smoking!". Not yet a final entry though.
{-# OPTIONS -funbox-strict-fields #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- http://haskell.org/hawiki/ShootoutEntry -- -- Contributed by Don Stewart -- A regex-dna entry based on lazy regex combinators, described in the paper: -- -- ``Lazy Lexing is Fast'', Manuel M. T. Chakravarty, in A. -- Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. -- import Prelude hiding (last) import List import Maybe import Data.Array (Array, (!), assocs, accumArray) import qualified Data.IntMap as I main = getContents >>= putStr . output output s0 = unlines $ counts ++ [[]] ++ map show [length s0, length s1, length s2] where s1 = clean False s0 s2 = replaceIubs s1 (_,(LS _ m)) = execLexer patterns (LS s1 I.empty) counts = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i m)) $ variants -- match variants patterns = foldl1 (>||<) $ map (uncurry match) (zip pats [0..]) where pats = [(string "agggtaaa" >|< string "tttaccct") ,((cgt +> string "gggtaaa") >|< (string "tttaccc" +> acg)) ,((a +> act +> string "ggtaaa") >|< (string "tttacc" +> agt +> t)) ,((string "ag" +> act +> string "gtaaa") >|< (string "tttac" +> agt +> string "ct")) ,((string "agg" +> act +> string "taaa") >|< (string "ttta" +> agt +> string "cct")) ,((string "aggg" +> acg +> string "aaa") >|< (string "ttt" +> cgt +> string "ccct")) ,((string "agggt" +> cgt +> string "aa") >|< (string "tt" +> acg +> string "accct")) ,((string "agggta" +> cgt +> a) >|< (t +> acg +> string "taccct")) ,((string "agggtaa" +> cgt) >|< (acg +> string "ttaccct"))] cgt = alt "cgt"; act = alt "act"; acg = alt "acg"; agt = alt "agt" a = char 'a'; t = char 't' -- each pattern keeps track of its own count, so we can perform a single pass match p i = p `meta` \_ m -> R Nothing (I.insertWith (+) i 1 m) Nothing variants = zip [0..] ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] ------------------------------------------------------------------------ -- remove "\n" and lines with ">" -- these should be a regex match. I don't think 'haskell pattern match' is legal :) clean n [] = [] clean _ ('>':cs) = clean True cs clean _ ('\n':cs) = clean False cs clean True (c:cs) = clean True cs clean False (c:cs) = c:(clean False cs) -- replace iubs, should also use regex matching replaceIubs s = concatMap pairs s pairs 'B' = "(c|g|t)" pairs 'D' = "(a|g|t)" pairs 'H' = "(a|c|t)" pairs 'K' = "(g|t)" pairs 'M' = "(a|c)" pairs 'N' = "(a|c|g|t)" pairs 'R' = "(a|g)" pairs 'S' = "(c|g)" pairs 'V' = "(a|c|g)" pairs 'W' = "(a|t)" pairs 'Y' = "(c|t)" pairs a = [a] ------------------------------------------------------------------------ -- And now the regex library -- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 24 February 95, 2 March 99 -- Copyright (c) [1995..2000] Manuel M. T. Chakravarty -- Copyright (c) 2004-5 Don Stewart -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- http://www.cse.unsw.edu.au/~chak/papers/Cha99.html -- -- Thanks to Simon L. Peyton Jones <simonpj@microsoft.com> and Roman -- Lechtchinsky <wolfro@cs.tu-berlin.de> for their helpful suggestions that -- improved the design of this library. -- infixr 4 `quest`, `star`, `plus` infixl 3 +>, `action`, `meta` infixl 2 >|<, >||< -- we use the dense representation if a table has at least the given number of -- (non-error) elements -- denseMin :: Int denseMin = 20 -- represents the number of (non-error) elements and the bounds of a table -- data BoundsNum = B !Int !Char !Char -- combine two bounds -- addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- Lexical actions take a lexeme with its position and may return a token; in -- a variant, an error can be returned (EXPORTED) -- -- * if there is no token returned, the current lexeme is discarded lexing -- continues looking for a token -- type Action t = String -> Maybe t -- Meta actions transform the lexeme, and a user-defined state; they -- may return a lexer, which is then used for accepting the next token -- (this is important to implement non-regular behaviour like nested -- comments) (EXPORTED) -- type Meta t = String -> S -> Result t data Result t = R (Maybe t) !S (Maybe (Lexer t)) -- threaded top-down during lexing (current input, meta state) data LexerState = LS !String !S type S = I.IntMap Int -- tree structure used to represent the lexer table (EXPORTED ABSTRACTLY) -- -- * each node in the tree corresponds to a state of the lexer; the associated -- actions are those that apply when the corresponding state is reached -- data Lexer t = Lexer (LexAction t) (Cont t) -- represent the continuation of a lexer -- data Cont t = -- on top of the tree, where entries are dense, we use arrays Dense BoundsNum (Array Char (Lexer t)) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse BoundsNum [(Char, Lexer t)] -- end of a automaton | Done -- lexical action data LexAction t = Action !(Meta t) | NoAction -- a regular expression type Regexp t = Lexer t -> Lexer t -- Empty lexeme epsilon :: Regexp t epsilon = id -- One character regexp char :: Char -> Regexp t char c = \l -> Lexer NoAction (Sparse (B 1 c c) [(c, l)]) -- Concatenation of regexps (+>) :: Regexp t -> Regexp t -> Regexp t (+>) = (.) -- Close a regular expression with an action that converts the lexeme into a -- token -- -- * Note: After the application of the action, the position is advanced -- according to the length of the lexeme. This implies that normal -- actions should not be used in the case where a lexeme might contain -- control characters that imply non-standard changes of the position, -- such as newlines or tabs. -- action re a = re `meta` a' where a' lexeme s = R (a lexeme) s Nothing {-# INLINE action #-} -- Close a regular expression with a meta action (EXPORTED) -- -- * Note: Meta actions have to advance the position in dependence of the -- lexeme by themselves. -- meta re a = re (Lexer (Action a) Done) {-# INLINE meta #-} -- disjunctive combination of two regexps re >|< re' = \l -> re l >||< re' l -- disjunctive combination of two lexers (Lexer a c) >||< (Lexer a' c') = Lexer (joinActions a a') (joinConts c c') -- combine two disjunctive continuations -- joinConts :: Cont t -> Cont t -> Cont t joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, cls) -- combine two actions. Use the latter in case of overlap (!) joinActions NoAction a' = a' joinActions a NoAction = a joinActions _ a' = a' -- error "Lexers.>||<: Overlapping actions!" -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls) | otherwise = Sparse bn (accum (>||<) cls) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes') -- x `star` y corresponds to the regular expression x*y (EXPORTED) -- -- The definition used below can be obtained by equational reasoning from this -- one (which is much easier to understand): -- -- star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2 -- -- However, in the above, `self' is of type `Regexp s t' (ie, a functional), -- whereas below it is of type `Lexer s t'. Thus, below we have a graphical -- body (finite representation of an infinite structure), which doesn't grow -- with the size of the accepted lexeme - in contrast to the definition using -- the functional recursion. star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length cs) (minimum cs) (maximum cs) in Lexer NoAction (aggregate bnds [(c, l) | c <- cs]) -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- apply a lexer, yielding a token sequence and a list of errors -- -- * Currently, all errors are fatal; thus, the result is undefined in case of -- an error (this changes when error correction is added). -- * The final lexer state is returned. -- * The order of the error messages is undefined. -- * the following is moderately tuned -- execLexer _ st@(LS [] _) = ([], st) execLexer l st = case lexOne l st of (Nothing , _ , st') -> execLexer l st' (Just t, l', st') -> let (ts, final) = execLexer l' st' in (t:ts, final) where -- accept a single lexeme lexOne l0 st'@(LS cs s) = oneLexeme l0 st' id lexErr where lexErr = (Just undefined, l, (LS (tail cs) s)) -- we take an open list of characters down, where we accumulate the -- lexeme; this function returns maybe a token, the next lexer to use -- (can be altered by a meta action), the new lexer state, and a list -- of errors -- -- we implement the "principle of the longest match" by taking a -- potential result quadruple down (in the last argument); the -- potential result quadruple is updated whenever we pass by an action -- (different from `NoAction'); initially it is an error result -- oneLexeme (Lexer a cont) state@(LS cs s) csDL last = let last' = doaction a csDL state last in case cs of [] -> last' -- at end, has to be this action (c:cs') -> oneChar cont c (LS cs' s) csDL last' -- keep looking -- There are more chars. Look at the next one -- Now, if the next tbl is Done, then there is no more -- transition, so immediately execute our action oneChar tbl c state csDL last = case peek tbl c of Nothing -> last Just (Lexer a Done) -> doaction a (\l -> csDL (c:l)) state last Just l' -> oneLexeme l' state (\l -> csDL (c:l)) last -- Do the lookup. peek (Dense bn arr) c | c `inBounds` bn = Just $ arr ! c peek (Sparse bn cls) c | c `inBounds` bn = lookup c cls peek _ _ = Nothing -- execute the action if present and finalise the current lexeme doaction (Action f) csDL (LS cs s) _ = case f (($[]) csDL) s of (R Nothing s' l') | not . null $ cs -> lexOne (fromMaybe l0 l') (LS cs s') (R res s' l') -> (res, (fromMaybe l0 l'), (LS cs s')) doaction NoAction _ _ last = last
9.3.1 Front end
This may plug into the above to allow it to parse normal regular expressions.
-- By Chris Kuklewicz, using Pattern idea from Alson -- Add this code block to parse normal string regexp to (Regexp Char) -- This code has not been space optimized, and need not be speed optimized import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),GenParser) import Control.Monad (liftM,liftM2) -- The exported function from this code block convert :: String -> Regexp Char convert = reflect . parseRegex -- The string regexp format, exported from this code block variants = ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] -- Make Regexp Char from Pattern reflect :: Pattern -> Regexp Char reflect PEnd = epsilon reflect (POr ps) = foldl1 (>|<) (map reflect ps) reflect (PConcat ps) = stitch ps reflect (PQuest p) = reflect p `quest` epsilon reflect (PPlus p) = reflect p `plus` epsilon reflect (PStar p) = reflect p `star` epsilon reflect (PAny set) = alt set reflect (PAny' _) = error "There is no [^abc] support" reflect (PChar c) = char c reflect (PDot) = error "There is no . support in Lexer" reflect (PString s) = string s -- Help reflect (PConcat ps) stitch :: [Pattern] -> Regexp Char stitch [] = epsilon stitch (x:[]) = reflect x stitch ((PQuest p):xs) = reflect p `quest` stitch xs stitch ((PPlus p) :xs) = reflect p `plus` stitch xs stitch ((PStar p) :xs) = reflect p `star` stitch xs stitch (x:xs) = (reflect x) +> stitch xs -- Make p_regex to parse strings to Pattern parseRegex = simplify.(either (error.show) id).(\x -> P.parse p_regex x x) -- tr was for debugging --tr = (\result -> do unsafePerformIO (print result >> return (return result))) tr = return p_regex = liftM POr (P.sepBy1 p_branch (P.char '|')) >>= tr p_branch = liftM PConcat (P.many1 p_piece) >>= tr p_piece = p_atom >>= p_post_atom >>= tr p_atom = P.try (P.string "()" >> return PEnd) -- must not consume '(' yet <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot -- <|> p_caret -- <|> p_dollar <|> p_escaped_char <|> p_other_char -- <|> unsafePerformIO (print "no atom" >> return pzero) -- p_post_atom takes the previous atom as a parameter p_post_atom PEnd = return PEnd p_post_atom atom = (P.char '?' >> return (PQuest atom)) <|> (P.char '+' >> return (PPlus atom)) <|> (P.char '*' >> return (PStar atom)) -- no {n,m} for p_bound yet <|> (return atom) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) >>= tr -- p_set does not support [.ch.] or [=y=] or [:foo:] p_set :: Bool -> GenParser Char st Pattern p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle))) result = if invert then PAny' characters else PAny characters return result p_dot = P.char '.' >> return PDot p_escaped_char = P.char '\\' >> liftM PChar P.anyChar p_other_char = liftM PChar (P.noneOf specials) where specials = "^.[$()|*+?\\" -- full set is "^.[$()|*+?{\\" but no '{' yet for bounds notPEnd = filter (/=PEnd) simplify x = case x of POr ps -> case notPEnd (map simplify ps) of [] -> PEnd [p] -> p ps' -> POr ps' PConcat ps -> case merge (notPEnd (map simplify ps)) of [] -> PEnd [p] -> p ps' -> PConcat ps' PQuest p -> case (simplify p) of PEnd -> PEnd; p' -> PQuest p' PPlus p -> case (simplify p) of PEnd -> PEnd; p' -> PPlus p' PStar p -> case (simplify p) of PEnd -> PEnd; p' -> PStar p' x -> x -- merge helps simplify PConcat merge ((PChar c1):(PChar c2):xs) = merge $ (PString [c1,c2]):xs merge ((PString s):(PChar c):xs) = merge $ (PString (s++[c])):xs merge (x:xs) = x:merge xs merge [] = []
9.4 Use parsec as front end
In order to avoid the Text.Regex library, the entry can be coded up as follows -- using Parsec!
I only had time to add some regex parsing code, and escaping in show (PAny). postfix operators '+' and '*' can be easily added to ops table. It can now create "patterns" from "variants" -- ChrisKuklewicz
This is definitely the way to do it, however, this isn't yet a legal entry, in my opinion -- it is close though. The spec requires that the cleaning phases uses regular expressions too. So all string transformations must be regex based. However, noone says we have to use Text.Regex. The breakthough here is to use Parsec-style regular expressions. A beautiful advertisement of Haskell, I think. Also, make sure lines wrap roughly at 80 chars if possible. Another clever idea would be to write a read instance for Patterns, so that you simply {{{read "a.*b"}}} and get a Pattern back (int-e on #haskell's idea). Perhaps we could write a pure parsec version, and dispense with the translation of strings to parsec combinators all together? Not sure if non-haskellers would understand though. Depends on whether the result looks like regexes. -- dons
To dons question "Is this legal?" I think that it's in the grey area. Per my notes on the ShootoutEntry page, I vote that we submit: Functional 2 as the idiomatic GHC entry; this entry (in its finished form) as the fast GHC #2 entry noting that we didn't use the RegEx library because it was slow and it was just about as easy to code around it. Not all of the other entries are purely RegEx based (for the IUBs), so I don't feel too badly. Anyway, I think that this represents one of the neat aspects of Haskell: more so than C, C++, etc, it seems that there are always multiple, very different ways of doing things in Haskell.-- AlsonKemp
Response: ok, then is this the shortest, fastest entry? Or can we profile and improve it? -- dons
-- -- The Great Computer Language Shootout - http://shootout.alioth.debian.org/ -- Haskell Wiki page for Shootout entries - http://haskell.org/hawiki/ShootoutEntry -- -- Contributed by Alson Kemp -- Faster program that doesn't use the RegEx library because of the -- slowness of Haskell's bindings to the library -- import Text.Regex import System.IO import Data.List -- Needed for Parser code import Text.ParserCombinators.Parsec import Text.ParserCombinators.Parsec.Expr import Control.Monad import Data.Either variants = ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] pairsList = [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)"),("M","(a|c)") ,("N","(a|c|g|t)"),("R","(a|g)"),("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)") ] pep = buildExpressionParser ops p_factor ops = [ [ Infix (char '|' >> return POr) AssocRight ] ] -- [ Postfix (char '+' >> return PPlus), Postfix (char '*' >> return PStar) ] -- put before '|' for precedence special = "[()]|*+\\" -- no parenthesis grouping yet s_normal esc = many1 (c_escaped <|> noneOf esc) -- String c_escaped = char '\\' >> anyChar -- Char p_factor = p_any <|> p_string <|> (return PEnd) p_string = liftM2 PString (s_normal special) p_factor p_any = do s <- between (char '[') (char ']') (s_normal "]") liftM (PAny (nub (sort s))) p_factor -- "parsed_variants" is identical to "patterns" parsed_variants = map ((either (error.show) id).(\x -> runParser pep () x x)) variants test_parsing = (parsed_variants == patterns) -- is True main = getContents >>= putStr . output output s = unlines $ countMatches ++ [[]] ++ map show [length s, length sclean, length sreplace] where sclean = cleangene False s countMatches = map (\p -> (show p) ++ ' ': show (numHits 0 p sclean)) patterns sreplace = replaceIubs sclean -- remove "\n" and lines with ">" cleangene n [] = [] cleangene _ ('>':cs) = cleangene True cs cleangene _ ('\n':cs) = cleangene False cs cleangene True (c:cs) = cleangene True cs cleangene False (c:cs) = c:(cleangene False cs) -- count hits numHits n _ [] = n numHits n p (s:ss) = case (pmatch (s:ss) p) of False -> numHits n p ss otherwise -> numHits (n+1) p ss data Pattern = POr Pattern Pattern | PString String Pattern | PAny String Pattern | PEnd deriving
9.5 Lexer 6: Bug Fixed and optimized version
This has fixed the bug that had required the annoying {{{count'}}} work-around. It uses {{{meta1}}} to indicate that the input should only be advanced by a single character after the lexeme matching action. -- ChrisKuklewicz
Various "optimizations": The {{{LS}}} data type has been removed. Usage of {{{Maybe}}} has been reduced: for the {{{peek}}} and the {{{Result}}} type.
Note that some of the tricks we play here may be illegal under the new tightened spec - Don
{-# OPTIONS -funbox-strict-fields #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- http://haskell.org/hawiki/ShootoutEntry -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. -- -- For more about higher-order combinator-based lexing and parsing in Haskell consult: -- -- Graham Hutton. Higher-order functions for parsing. (1992) -- Journal of Functional Programming 2: 232-343. -- -- Jeroen Fokker. Functional Parsers. (1995) -- Lecture Notes of the Baastad Spring school on Functional Programming. -- -- Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996) -- Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham. -- -- Steve Hill. Combinators for parsing expressions. (1996) -- Journal of Functional Programming 6(3): 445-463. -- -- Andrew Partridge and David Wright. -- Predictive parser combinators need four values to report errors. (1996) -- Journal of Functional Programming 6(2): 355-364. -- -- Doaitse Swierstra and Luc Duponcheel. -- Deterministic, Error-Correcting Combinator Parsers. (1996) -- Advanced Functional Programming. LNCS 1129: 185-207. -- -- Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999) -- Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138. -- -- Doaitse Swierstra and Pablo Azero. -- Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999) -- SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129. -- -- Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) -- Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183. -- -- And many other sources. -- import List (sort,nub,(\\)) import Data.Array (Array, (!), assocs, accumArray) import qualified Data.Map as M import qualified Data.IntMap as I import Control.Monad (liftM) import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>)) main = interact $ \stdin -> let l0 = length stdin in l0 `seq` let s1 = fst $ run clean stdin I.empty l1 = length s1 in l1 `seq` let iMap = snd.snd $ run count s1 I.empty counts = map (\(i,p) -> p++" "++show (I.findWithDefault 0 i iMap)) variants l2 = length . concat . fst $ run replace s1 I.empty in unlines $ counts ++ ([] : map show [l0, l1, l2]) -- remove "\n" and lines with ">" clean = (regex "." `action` \[c] -> Just c) >||< (regex "\n|>[^\n]+\n" `action` const Nothing) -- Right biased -- count all variants, accumulating count in state threaded through regex matcher count = foldl1 (>||<) [ match (regex s, n) | (n,s) <- variants ] -- each pattern keeps track of its own count, so we can perform a single pass match (p,i) = p `meta1` \_ lexer m -> R Nothing (I.insertWith (+) i 1 m) lexer variants = zip [0..] ["agggtaaa|tttaccct","[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] -- substitute certain chars for patterns replace = (regex "." `action` \s -> Just s) >||< foldl1 (>||<) (map (\(c,p) -> regex [c] `action` const (Just p)) pairs) -- Right biased pairs = [('B',"(c|g|t)"),('D',"(a|g|t)"), ('H',"(a|c|t)"),('K',"(g|t)") ,('M',"(a|c)"), ('N',"(a|c|g|t)"),('R',"(a|g)"), ('S',"(c|g)") ,('V',"(a|c|g)"),('W',"(a|t)"), ('Y',"(c|t)") ] ---------------------------------------------------------------- -- -- Construct a regex combinator from a string regex (use Parsec) -- regex = (either (error.show) id) . (\x -> P.parse p_regex x x) p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|')) p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom)) p_atom = P.try (P.string "()" >> return epsilon) -- must not consume '(' yet <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char p_post_atom atom = (P.char '?' >> return (atom `quest`)) <|> (P.char '+' >> return (atom `plus`)) <|> (P.char '*' >> return (atom `star`)) <|> (return (atom +>)) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) return $ if invert then alt ( ['\0'..'\127'] \\ characters ) else alt characters p_dot = P.char '.' >> return (alt ['\0'..'\127']) p_escaped_char = P.char '\\' >> liftM char P.anyChar p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\" ------------------------------------------------------------------------ -- And everything else is the CTK library. -- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 24 February 95, 2 March 99 -- Copyright (c) [1995..2000] Manuel M. T. Chakravarty -- Copyright (c) 2004-6 Don Stewart -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- http://www.cse.unsw.edu.au/~chak/papers/Cha99.html -- -- Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their -- helpful suggestions that improved the design of this library. -- -- EXTERNAL INTERFACE infixr 4 `quest`, `star`, `plus` infixl 3 +>, `action`, `meta` infixl 2 >|<, >||< -- Empty lexeme epsilon :: Regexp t epsilon = id -- One character regexp char :: Char -> Regexp t char c = \l -> Lexer NoAction (Sparse (B 1 c c) (M.singleton c l)) -- Concatenation of regexps (+>) :: Regexp t -> Regexp t -> Regexp t (+>) = (.) -- x `star` y corresponds to the regular expression x*y -- -- The definition used below can be obtained by equational reasoning from this -- one (which is much easier to understand): -- -- star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2 -- -- However, in the above, `self' is of type `Regexp s t' (ie, a functional), -- whereas below it is of type `Lexer s t'. Thus, below we have a graphical -- body (finite representation of an infinite structure), which doesn't grow -- with the size of the accepted lexeme - in contrast to the definition using -- the functional recursion. star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length cs) (minimum cs) (maximum cs) in Lexer NoAction (aggregate bnds [(c, l) | c <- cs]) -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- disjunctive combination of two regexps re >|< re' = \l -> re l >||< re' l -- disjunctive combination of two lexers (Lexer a c) >||< (Lexer a' c') = Lexer (joinActions a a') (joinConts c c') -- Close a regular expression with an action that converts the lexeme into a -- token -- -- * Note: After the application of the action, the position is advanced -- according to the length of the lexeme. This implies that normal -- actions should not be used in the case where a lexeme might contain -- control characters that imply non-standard changes of the position, -- such as newlines or tabs. -- action re a = re `meta` a' where a' lexeme lexer s = R (a lexeme) s lexer {-# INLINE action #-} action1 re a = re `meta1` a' where a' lexeme lexer s = R (a lexeme) s lexer {-# INLINE action1 #-} -- Close a regular expression with a meta action -- -- * Note: Meta actions have to advance the position in dependence of the -- lexeme by themselves. -- meta re a = re (Lexer (Action a) Done) {-# INLINE meta #-} meta1 re a = re (Lexer (Action1 a) Done) {-# INLINE meta1 #-} -- INTERNAL IMPLEMENTATION -- we use the dense representation if a table has at least the given -- number of (non-error) elements denseMin :: Int denseMin = 20 -- represents the number of (non-error) elements and the bounds of a table data BoundsNum = B !Int !Char !Char -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- Lexical actions take a lexeme with its position and may return a -- token; in a variant, an error can be returned -- -- * if there is no token returned, the current lexeme is discarded -- lexing continues looking for a token type Action t = String -> Lexer t -> Maybe t -- Meta actions transform the lexeme, the old lexer, and a -- user-defined state; they may return the old or a new lexer, which -- is then used for accepting the next token (this is important to -- implement non-regular behaviour like nested comments) type Meta t = String -> Lexer t -> S -> Result t data Result t = R (Maybe t) !S !(Lexer t) -- threaded top-down during lexing (with current input) type S = I.IntMap Int -- tree structure used to represent the lexer table -- -- * each node in the tree corresponds to a state of the lexer; the -- associated actions are those that apply when the corresponding -- state is reached data Lexer t = Lexer (LexAction t) (Cont t) -- represent the continuation of a lexer -- on top of the tree, where entries are dense, we use arrays data Cont t = Dense BoundsNum (Array Char (Lexer t)) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse BoundsNum (M.Map Char (Lexer t)) -- end of a automaton | Done -- lexical action data LexAction t = Action !(Meta t) | Action1 !(Meta t) | NoAction -- need new LexAction for advance by 1 -- a regular expression type Regexp t = Lexer t -> Lexer t -- combine two disjunctive continuations joinConts :: Cont t -> Cont t -> Cont t joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- combine two actions. Use the latter in case of overlap (right biased!) joinActions NoAction a' = a' joinActions a NoAction = a joinActions _ a' = a' -- error "Lexers.>||<: Overlapping actions!" -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer NoAction Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes') -- apply a lexer, yielding a token sequence and a list of errors -- -- * Currently, all errors are fatal; thus, the result is undefined in case of -- an error (this changes when error correction is added). -- * The final lexer state is returned. -- * The order of the error messages is undefined. -- * the following is moderately tuned run _ [] state = ([], ([],state)) run l csIn state = case lexOne l csIn state of (Nothing , _ , cs', state') -> run l cs' state' (Just t , l', cs', state') -> let (ts, final) = run l' cs' state'; ts' = (t:ts) in ts' `seq` (ts',final) where -- accept a single lexeme lexOne l0 cs0 state0 = oneLexeme l0 cs0 state0 id lexErr where lexErr = (Just undefined, l, tail cs0, state0) -- we take an open list of characters down, where we -- accumulate the lexeme; this function returns maybe a token, -- the next lexer to use (can be altered by a meta action), -- the new lexer state, and a list of errors -- -- we implement the "principle of the longest match" by taking -- a potential result quad down (in the last argument); the -- potential result quad is updated whenever we pass by an -- action (different from `NoAction'); initially it is an -- error result oneLexeme (Lexer a cont) cs state csDL last = let last' = case a of NoAction -> last; _ -> doaction a (csDL []) cs state last in case cs of [] -> last' -- at end, has to be this action (c:cs') -> last' `seq` oneChar cont c cs' state csDL last' -- keep looking -- There are more chars. Look at the next one. Now, if the -- next tbl is Done, then there is no more transition, so -- immediately execute our action oneChar tbl c cs state csDL last = case peek tbl c of (Lexer action Done) -> doaction action (csDL [c]) cs state last l' -> oneLexeme l' cs state (csDL . (c:)) last -- Do the lookup. peek (Dense bn arr) c | c `inBounds` bn = arr ! c peek (Sparse bn cls) c | c `inBounds` bn = M.findWithDefault (Lexer NoAction Done) c cls peek _ _ = (Lexer NoAction Done) -- execute the action if present and finalise the current lexeme doaction (Action f) csDL cs state _ = case f (csDL) l0 state of (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' cs s' (R res s' l') -> s' `seq` (res, l', cs, s') doaction (Action1 f) csDL cs state _ = case f (csDL) l0 state of (R Nothing s' l') | not . null $ cs -> s' `seq` lexOne l' (tail csIn) s' (R res s' l') -> s' `seq` (res, l', (tail csIn), s') doaction NoAction _ _ _ last = last