User talk:WillNess

From HaskellWiki
Jump to: navigation, search


In the past (up to 2013?), you have taken quite an active role in maintaining the Haskell Wiki page on primes at this link so I didn't want to arbitrarily jump in an make some changes, as I think that the page is still relevant for beginning Haskell programmers implementing prime sieves but you may not agree with the changes I am proposing.

Specifically, the above link refers to using Template Haskell but in fact doesn't use TH at all and is just a slightly restructured not-very-efficient odds-only version of the ST mutable array version immediately above it (also odds only). I don't see that the additional example adds anything other than confusion to the page in referring to TH when it isn't used and just being a less concise version of the above version, also referring to bit-packing which all use of Unboxed Bool arrays are. The first version should likely be noted to be an odds-only SoE to avoid any confusion and the second labeled as TH eliminated.

Using macros from TH is a nice facility to use in implementing some extreme optimizations to reduce average culling operation loop time to sub two CPU clock cycles per culling operation as it did in my contribution to The Software Drag Race linked here which is a benchmark repeatedly culling for the primes up to a million and measuring the number of times this could be done in five seconds; however, the full optimizations of extreme loop unrolling an dense culling using SIMD for small base primes less than 129 would add 100's of lines of code to the example and these optimizations, while they contribute to speed up sieving to some quite large ranges, are more in the nature of constant offset improvements and not constant factor improvements so don't really add to the considerations of the page.

Having eliminated the above confusing contribution, there is an argument that there should be a page-segmented odds-only version using mutable unboxed arrays, as only this technique gives adequate speed for large ranges producing over millions of primes with only about four CPU clock per culling operation as compared to the purely functional techniques that take an average of a of hundreds of CPU clock cycles per culling operation (constant factor decrease) to show why the functional techniques are never used for serious production prime-finding/counting work.

As example of a short odds-only page-segmented version, I refer you to my RosettaCode version via this link, reproduced here with some small reductions in code size as follows:

{-# OPTIONS_GHC -O2 #-}

import Data.Int ( Int64 )
import Data.Word ( Word64 )
import Data.Bits ( Bits(shiftR) )
import Data.Array.Base ( IArray(unsafeAt), UArray(UArray),     
                         MArray(unsafeWrite), unsafeFreezeSTUArray ) 
import Control.Monad ( forM_ )
import Data.Array.ST ( MArray(newArray), runSTUArray )

type Prime = Word64

cSieveBufferLimit :: Int
cSieveBufferLimit = 2^17 * 8 - 1 -- CPU L2 cache in bits

primes :: () -> [Prime]
primes() = 2 : _Y (listPagePrms . pagesFrom 0) where
  _Y g = g (_Y g) -- non-sharing multi-stage fixpoint combinator
  listPagePrms pgs@(hdpg@(UArray lwi _ rng _) : tlpgs) =
    let loop i | i >= fromIntegral rng = listPagePrms tlpgs
               | unsafeAt hdpg i = loop (i + 1)
               | otherwise = let ii = lwi + fromIntegral i in
                             case fromIntegral $ 3 + ii + ii of
                               p -> p `seq` p : loop (i + 1) in loop 0
  makePg lwi bps = runSTUArray $ do
    let limi = lwi + fromIntegral cSieveBufferLimit
        bplmt = floor $ sqrt $ fromIntegral $ limi + limi + 3
        strta bp = let si = fromIntegral $ (bp * bp - 3) `shiftR` 1
                   in if si >= lwi then fromIntegral $ si - lwi else
                   let r = fromIntegral (lwi - si) `mod` bp
                   in if r == 0 then 0 else fromIntegral $ bp - r
    cmpsts <- newArray (lwi, limi) False
    fcmpsts <- unsafeFreezeSTUArray cmpsts
    let cbps = if lwi == 0 then listPagePrms [fcmpsts] else bps
    forM_ (takeWhile (<= bplmt) cbps) $ \ bp ->
      forM_ (let sp = fromIntegral $ strta bp
             in [ sp, sp + fromIntegral bp .. cSieveBufferLimit ]) $ \c ->
        unsafeWrite cmpsts c True
    return cmpsts
  pagesFrom lwi bps = map (`makePg` bps)
                          [ lwi, lwi + fromIntegral cSieveBufferLimit + 1 .. ]

main :: IO ()
main = print $ last $ take (10^ (10 :: Int)) $ primes()

The above code finds the first ten million primes in about 0.73 seconds, the first hundred million primes in about 7.6 seconds, the first thousand million (billion) primes in about 85.7, and the first ten billion primes in about 1087 seconds (Intel i5-6500 at 3.6 Ghz single threaded boost) for a empirical growth factor of about 1.1 to 1.2, which is about the theoretical limit.

Most of the above times are the time taken for the primes list processing, where the actual culling time is some fraction of the total time, with further optimizations plus multi-threading on multi-core CPU's able to make this small fraction even smaller. This is why conventional functional/incremental SoE's have very limited use for finding the first million primes or so as compared to use of mutable array based page-segmented SoE's which must be used for larger ranges, have further optimizations reducing culling operation time by as much as an average of two time for these ranges, as well as the mentioned easy multi-threading by pages and further reductions in the number of operations through wheel factorizations, which is why I think at least a basic version should have a place on this Wiki page.

GordonBGood 05:55, 10 February 2023 (UTC)

@GordonBGood The TH entry was there before my time, and I just left it as is. I think I've restructured this page once, after having added many entries, and moved some other (previously existing) entries out into a separate page (or two), of course with the link to it. The reason being that I wanted WebArchive to capture the page's copy in its entirety, and IIRC it wasn't doing that, if the page was too long. So of course it'll be perfectly fine and welcomed if you add some new entries, or perhaps add them as links only, linking to the full separate pages if needed, as I did before. -- Cheers! WillNess (talk)
@WillNess, I've removed the TH entry which wasn't really TH and a rather poorer version of the ST mutable array version just above it and replaced it with the Page-Segmented ST-Mutable array version as per the above, and have also added a reasonably short version of a prime counting function in a new section. The page doesn't look to be too long as of yet, but if I do further edits, I'll look into adding just links to new pages as you have recommended above. Thanks. GordonBGood 09:51, 20 February 2023 (UTC)
Just one more thing maybe -- I've put a lot of thought into the progression of gradual exposition of all these versions, so when/if you add something new, could you perhaps continue adding it closer to the bottom of the page, as you indeed have already done with your last edits, so the flow is preserved? Entirely up to your judgment of course. -- WillNess (talk) 11:56, 21 February 2023 (UTC)