Wc: Difference between revisions
DonStewart (talk | contribs) m (wibble) |
DonStewart (talk | contribs) (Rerun all benchmarks) |
||
Line 10: | Line 10: | ||
$ du -hs /usr/share/dict/words | $ du -hs /usr/share/dict/words | ||
912K /usr/share/dict/words | |||
$ time wc -l /usr/share/dict/words | $ time wc -l < /usr/share/dict/words | ||
98326 | |||
wc -l /usr/share/dict/words 0.00s user 0.00s system | wc -l < /usr/share/dict/words 0.00s user 0.00s system 27% cpu 0.015 total | ||
So the best we can probably hope to get is around 0. | So the best we can probably hope to get is around 0.015s | ||
== Standard [Char] == | == Standard [Char] == | ||
Line 25: | Line 25: | ||
$ ghc -O wc.hs | $ ghc -O wc.hs | ||
$ time ./ | $ time ./wc < /usr/share/dict/words | ||
98326 | |||
./ | ./wc < /usr/share/dict/words 0.10s user 0.00s system 94% cpu 0.106 total | ||
Ok | Ok. About 10x C, as to be expected with a list representation. | ||
== Faster [Char] == | == Faster [Char] == | ||
Line 43: | Line 43: | ||
$ ghc -O wc.hs | $ ghc -O wc.hs | ||
$ time ./ | $ time ./wc < /usr/share/dict/words | ||
98326./wc < /usr/share/dict/words 0.06s user 0.00s system 87% cpu 0.073 total | |||
Ok. Not too bad. | |||
== Data.PackedString == | == Data.PackedString == | ||
Line 71: | Line 69: | ||
</haskell> | </haskell> | ||
$ time ./ | $ time ./wc | ||
98326 | |||
./ | ./wc < /usr/share/dict/words 0.14s user 0.02s system 95% cpu 0.168 total | ||
Hmm. Worse than [Char]. Unfortunately, this is not uncommon with Data.PackedString. | Hmm. Worse than [Char]. Unfortunately, this is not uncommon with Data.PackedString. | ||
Line 85: | Line 83: | ||
<haskell> | <haskell> | ||
import qualified Data.ByteString as B | import qualified Data.ByteString.Char8 as B | ||
main = print . length . B.lines =<< B.getContents | main = print . length . B.lines =<< B.getContents | ||
</haskell> | </haskell> | ||
$ time ./ | $ time ./wc < /usr/share/dict/words | ||
98326 | |||
./ | ./wc < /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total | ||
Excellent! Definitely competitive with C. This is in fact as fast as | |||
ByteString library is fusing length . lines, so that the intermediate | we'll get. It helps that the ByteString library is fusing length . | ||
list is not constructed. | lines, so that the intermediate list is not constructed. | ||
== Ptr hacking == | == Ptr hacking == | ||
Line 126: | Line 124: | ||
$ ghc -O -package fps -fglasgow-exts -cpp wc.hs | $ ghc -O -package fps -fglasgow-exts -cpp wc.hs | ||
$ time ./ | $ time ./wc /usr/share/dict/words | ||
98326 | |||
./ | ./wc /usr/share/dict/words 0.00s user 0.01s system 67% cpu 0.018 total | ||
Ok, slower than using length . lines. Lets try some other things. | |||
== Use the FFI == | == Use the FFI == | ||
Line 168: | Line 166: | ||
</haskell> | </haskell> | ||
$ time ./wc /usr/share/dict/words | |||
$ time ./ | 98326 | ||
./wc /usr/share/dict/words 0.00s user 0.00s system 47% cpu 0.017 total | |||
./ | |||
Slowly inching forwards. | Slowly inching forwards. | ||
Line 245: | Line 242: | ||
$ ghc -O -package fps -cpp -ffi wc.hs | $ ghc -O -package fps -cpp -ffi wc.hs | ||
$ time ./ | $ time ./wc /usr/share/dict/words | ||
98326 | |||
./ | ./wc /usr/share/dict/words 0.00s user 0.01s system 70% cpu 0.017 total | ||
But we can't seem to squeeze any more out, at least on data this size. | But we can't seem to squeeze any more out, at least on data this size. | ||
== Going via C == | == Going via C == | ||
Line 305: | Line 263: | ||
main = do | main = do | ||
f <- head `fmap` getArgs | f <- head `fmap` getArgs | ||
B. | B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> print (c_wc p l) | ||
foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int | foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int | ||
Line 325: | Line 283: | ||
</haskell> | </haskell> | ||
$ time ./ | $ gcc -O3 -c wc_c.c | ||
$ ghc -O -package fps wc.hs -o wc -fglasgow-exts wc_c.o | |||
./ | $ time ./wc /usr/share/dict/words | ||
98326 | |||
./wc /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total | |||
And we are done. Note that the tight C loop didn't give us anything in the end over | And we are done. Note that the tight C loop didn't give us anything in the end over | ||
the naive ByteString code, which is a satisfying result. | the naive ByteString code, which is a very satisfying result. |
Revision as of 04:08, 6 May 2006
Some implementations of the 'wc -l' program in Haskell, with an eye to C-like
performance. This illustrates the balance to be made between performance and
elegance, over several increasingly fast (and more complex) examples.
Baseline
The baseline is the C program 'wc'
$ du -hs /usr/share/dict/words 912K /usr/share/dict/words $ time wc -l < /usr/share/dict/words 98326 wc -l < /usr/share/dict/words 0.00s user 0.00s system 27% cpu 0.015 total
So the best we can probably hope to get is around 0.015s
Standard [Char]
main = print . length . lines =<< getContents
$ ghc -O wc.hs $ time ./wc < /usr/share/dict/words 98326 ./wc < /usr/share/dict/words 0.10s user 0.00s system 94% cpu 0.106 total
Ok. About 10x C, as to be expected with a list representation.
Faster [Char]
Perhaps writing our loop, rather than the duplication involved in length . lines, will improve things:
main = interact (count 0)
where count i [] = show i
count i ('\n':xs) = count (i+1) xs
count i (_:xs) = count i xs
$ ghc -O wc.hs $ time ./wc < /usr/share/dict/words 98326./wc < /usr/share/dict/words 0.06s user 0.00s system 87% cpu 0.073 total
Ok. Not too bad.
Data.PackedString
Ok, lets try the old Data.PackedString library.
My first attempt to directly use hGet failed, as hGet has a stack overflow for files > ~500k.
import Data.PackedString
import System.IO
main = print . length . linesPS =<< getit "/usr/share/dict/words"
where
getit f = do
h <- openFile f ReadMode
s <- hGetContents h
length s `seq` return ()
hClose h
return $! packString s
$ time ./wc 98326 ./wc < /usr/share/dict/words 0.14s user 0.02s system 95% cpu 0.168 total
Hmm. Worse than [Char]. Unfortunately, this is not uncommon with Data.PackedString.
Data.ByteString
Try to improve performance a bit by using the new Data.ByteString library, a replacement for Data.PackedString. This uses packed byte arrays instead of heap-allocated [Char] to represent strings.
import qualified Data.ByteString.Char8 as B
main = print . length . B.lines =<< B.getContents
$ time ./wc < /usr/share/dict/words 98326 ./wc < /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total
Excellent! Definitely competitive with C. This is in fact as fast as we'll get. It helps that the ByteString library is fusing length . lines, so that the intermediate list is not constructed.
Ptr hacking
ByteStrings give you access to the underlying pointers to bytes in memory, which can be used to optimise some particular code. So when the ByteString api doesn't provide what you want, you can step inside the ForeignPtr and go nuts.
This example also makes use of a cpp macro to force strictness on a function, via a seq guard case.
import Foreign
import Foreign.ForeignPtr
import System.Environment
import qualified Data.ByteString as B
#define STRICT4(f) f a b c d | a `seq` b `seq` c `seq` d `seq` False = undefined
main = head `fmap` getArgs >>= B.readFile >>= \(B.PS x _ l) ->
withForeignPtr x $ \p -> go p l 0 0
where go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
STRICT4(go)
go p l n i | n >= l = print i
| otherwise = do (w::Word8) <- peek (p `plusPtr` n)
go p l (n+1) $ if w == 0x0a then (i+1) else i
$ ghc -O -package fps -fglasgow-exts -cpp wc.hs $ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.01s system 67% cpu 0.018 total
Ok, slower than using length . lines. Lets try some other things.
Use the FFI
Try and step around the inefficent need to inspect each character in Haskell, by using memchr(3), the C function to find each newline for us.
import Foreign
import Foreign.ForeignPtr
import Foreign.C.Types
import System.Environment
import qualified Data.ByteString as B
#define STRICT4(f) f a b c d | a `seq` b `seq` c `seq` d `seq` False = undefined
main = do
f <- head `fmap` getArgs
B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> go p l 0 0
where
go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
STRICT4(go)
go p l n i
| n >= l = print i
| otherwise = do
let p' = p `plusPtr` n
q = memchr p' 0x0a (fromIntegral (l-n))
if q == nullPtr
then print i
else do let k = q `minusPtr` p'
go p l (n+k+1) (i+1)
foreign import ccall unsafe "string.h memchr" memchr
:: Ptr Word8 -> CInt -> CSize -> Ptr Word8
$ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.00s system 47% cpu 0.017 total
Slowly inching forwards.
Read the Core
While we're here, we can check whether the strictness on the 'go' function makes any difference, by reading the GHC Core:
$ ghc -O -package fps -cpp -ffi wc.hs -ddump-simpl | less
Search for the 'go' function:
Main.$wgo :: GHC.Prim.Addr# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.IOBase.IO ()
And without the strictness:
Main.$wgo :: GHC.Ptr.Ptr GHC.Word.Word8 -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Base.Int -> GHC.IOBase.IO ()
So GHC is helpfully unboxing the Ptr Word8 into a raw machine Addr#.
Avoid some code
The guard that checks the length is uneeded, since memchr takes a length argument anyway. It also calculates the next pointer for us, so avoid recalculating it. (Note that this is equivalent to using the 'count' function, which has the same implementation).
import Foreign
import Foreign.ForeignPtr
import Foreign.C.Types
import System.Environment
import qualified Data.ByteString as B
#define STRICT3(f) f a b c | a `seq` b `seq` c `seq` False = undefined
main = do
f <- head `fmap` getArgs
B.readFile f >>= \(B.PS x s l) -> withForeignPtr x $ \p ->
go (p `plusPtr` s) (fromIntegral l) 0
where
go :: Ptr Word8 -> CSize -> Int -> IO ()
STRICT3(go)
go p l i = do
let q = memchr p 0x0a l
if q == nullPtr
then print i
else do let k = fromIntegral $ q `minusPtr` p
go (q `plusPtr` 1) (l-k) (i+1)
foreign import ccall unsafe "string.h memchr" memchr
:: Ptr Word8 -> CInt -> CSize -> Ptr Word8
Checking the Core, 'go' is now:
Main.$wgo :: GHC.Prim.Addr# -> GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.IOBase.IO ()
The code is certainly a bit simpler, at least.
$ ghc -O -package fps -cpp -ffi wc.hs $ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.01s system 70% cpu 0.017 total
But we can't seem to squeeze any more out, at least on data this size.
Going via C
We reach a point where I can't think of any more tricks, so we can always code up a little C and call into that, for this tight loop. Sometimes we just have to do this, and that's what the ffi is for, after all.
-- wc.hs
import Foreign
import System.Environment
import qualified Data.ByteString as B
main = do
f <- head `fmap` getArgs
B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> print (c_wc p l)
foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int
-- wc_c.c
#include <sys/types.h>
#include <unistd.h>
int wc(char *p, int len) {
int c;
for (c = 0; len--; ++p)
if (*p == '\n')
++c;
return c;
}
-- wc.h
int wc(char *p, int len);
$ gcc -O3 -c wc_c.c $ ghc -O -package fps wc.hs -o wc -fglasgow-exts wc_c.o $ time ./wc /usr/share/dict/words 98326 ./wc /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total
And we are done. Note that the tight C loop didn't give us anything in the end over the naive ByteString code, which is a very satisfying result.