Difference between revisions of "Benchmarks Game/Parallel/RegexDNA"

Revision as of 17:58, 22 September 2008

Regex-DNA

Old submission is:

http://shootout.alioth.debian.org/u64q/benchmark.php?test=regexdna&lang=ghc&id=2

Running:

ghc --make -O2 -fglasgow-exts -package regex-posix -optc-O3 -threaded regexdna3.hs
./regexdna3 +RTS -N2 < big.txt

Code:

This is almost identical to the old code, with a parallel map used to perform the first phase of the benchmark (counting matches of each variant). I had trouble compiling the original with Text.Regex.Posix so I used Text.Regex.PCRE (which is probably faster, are hackage packages fair game?). I see a speedup from 46 seconds to 33 seconds when running this case -N2 vs the original case (with PCRE) unthreaded.

I see some weirdness in the shootout page's numbers. They list haskell running at 70 seconds (with Posix regex) and Python running as 25 seconds. I have a comparable system (roughly), and when I run the Python test case I get 46 seconds (nearly identical to the Haskell PCRE case). This probably means their machine has better memory bandwidth and that moving to PCRE is a big win (I believe python uses PCRE), but I'm not sure.

-- The Computer Language Benchmarks Game
-- http://shootout.alioth.debian.org/
-- Contributed by: Sergei Matusevich 2007

import Control.Parallel.Strategies
import List
import Text.Regex.PCRE -- requires regex-pcre-builtin
import qualified Data.ByteString.Char8 as B

variants = [
  "agggtaaa|tttaccct",
  "[cgt]gggtaaa|tttaccc[acg]",
  "a[act]ggtaaa|tttacc[agt]t",
  "ag[act]gtaaa|tttac[agt]ct",
  "agg[act]taaa|ttta[agt]cct",
  "aggg[acg]aaa|ttt[cgt]ccct",
  "agggt[cgt]aa|tt[acg]accct",
  "agggta[cgt]a|t[acg]taccct",
  "agggtaa[cgt]|[acg]ttaccct" ]

main = do
  file <- B.getContents
  let [s1,s2,s3] = map (B.concat . tail) $ groupBy notHeader $ B.split '\n' file
      showVars r = r ++ ' ' : show ((s2 =~ r :: Int) + (s3 =~ r :: Int))
  mapM_ putStrLn $ parMap rnf showVars  variants
  putChar '\n'
  print (B.length file)
  print (B.length s1 + B.length s2 + B.length s3)
  print (B.length s1 + B.length s3 + length (B.unpack s2 >>= substCh))
  where notHeader _ s = B.null s || B.head s /= '>'
        substCh 'B' = "(c|g|t)"
        substCh 'D' = "(a|g|t)"
        substCh 'H' = "(a|c|t)"
        substCh 'K' = "(g|t)"
        substCh 'M' = "(a|c)"
        substCh 'N' = "(a|c|g|t)"
        substCh 'R' = "(a|g)"
        substCh 'S' = "(c|g)"
        substCh 'V' = "(a|c|g)"
        substCh 'W' = "(a|t)"
        substCh 'Y' = "(c|t)"
        substCh etc = [etc]

The changes should be directly applicable to the non-PCRE version as well. Just change the import to use the Posix library. Here are the parallelizing changes:

--- regexdna2.hs        2008-09-22 07:56:49.000000000 -1000
+++ regexdna3.hs        2008-09-21 07:50:32.000000000 -1000
@@ -2,6 +2,7 @@
 -- http://shootout.alioth.debian.org/
 -- Contributed by: Sergei Matusevich 2007

+import Control.Parallel.Strategies
 import List
 import Text.Regex.PCRE -- requires regex-pcre-builtin
 import qualified Data.ByteString.Char8 as B
@@ -20,7 +21,8 @@
 main = do
   file <- B.getContents
   let [s1,s2,s3] = map (B.concat . tail) $ groupBy notHeader $ B.split '\n' file
-  mapM_ (printVars s2 s3) variants
+      showVars r = r ++ ' ' : show ((s2 =~ r :: Int) + (s3 =~ r :: Int))
+  mapM_ putStrLn $ parMap rnf showVars  variants
   putChar '\n'
   print (B.length file)
   print (B.length s1 + B.length s2 + B.length s3)
@@ -38,8 +40,4 @@
         substCh 'W' = "(a|t)"
         substCh 'Y' = "(c|t)"
         substCh etc = [etc]
-        printVars s2 s3 r = do
-                            putStr r
-                            putChar ' '
-                            print ((s2 =~ r :: Int) + (s3 =~ r :: Int))

Difference between revisions of "Benchmarks Game/Parallel/RegexDNA"

Revision as of 17:58, 22 September 2008

Regex-DNA

Navigation menu

Search