Difference between revisions of "Prime numbers"

Latest revision as of 20:34, 6 May 2023

In mathematics, amongst the natural numbers greater than 1, a prime number (or a prime) is such that has no divisors other than itself (and 1). This means that it cannot be represented as a product of any two of such numbers.

Prime Number Resources

At Wikipedia:
- Prime Numbers
- Sieve of Eratosthenes

HackageDB packages:
- arithmoi: Various basic number theoretic functions; efficient array-based sieves, Montgomery curve factorization ...
- Numbers: An assortment of number theoretic functions.
- NumberSieves: Number Theoretic Sieves: primes, factorization, and Euler's Totient.
- primes: Efficient, purely functional generation of prime numbers.

Papers:
- O'Neill, Melissa E., "The Genuine Sieve of Eratosthenes", Journal of Functional Programming, Published online by Cambridge University Press 9 October 2008 doi:10.1017/S0956796808007004.

Definition

In mathematics, amongst the natural numbers greater than 1, a prime number (or a prime) is such that has no divisors other than itself (and 1). The smallest prime is thus 2. Non-prime numbers are known as composite, i.e. those representable as product of two natural numbers greater than 1.

To find out a prime's multiples we can either a. test each new candidate number for divisibility by that prime, giving rise to a kind of trial division algorithm; or b. we can directly generate the multiples of a prime p by counting up from it in increments of p, resulting in a variant of the sieve of Eratosthenes.

The set of prime numbers is thus

P = { n ∈ N₂ : (∀ m ∈ N₂) ( (m | n) ⇒ m = n) }

= { n ∈ N₂ : (∀ m ∈ N₂) ( m < n ⇒ ¬(m | n)) }

= { n ∈ N₂ : (∀ m ∈ N₂) ( m ⋅ m ≤ n ⇒ ¬(m | n)) }

= N₂ \ { n ⋅ m : n,m ∈ N₂ }

= N₂ \ N₂ ⋅ N₂

= N₂ \ ⋃ { { n ⋅ m : m ∈ N_n } : n ∈ N₂ }

= N₂ \ ⋃ { { n ⋅ n, n ⋅ n+n, n ⋅ n+n+n, ... } : n ∈ N₂ }

= N₂ \ ⋃ { { p ⋅ p, p ⋅ p+p, p ⋅ p+p+p, ... } : p ∈ P }

( = N₂ \ ⋃ { { p ⋅ m : m ∈ N_p } : p ∈ P }

= N₂ \ ⋃_{p ∈ P} { p ⋅ N_p }

= N₂ \ P ⋅ N_P )

where N_k = { n ∈ N : n ≥ k }

Thus starting with 2, for each newly found prime we can eliminate from the rest of the numbers all the multiples of this prime, giving us the next available number as next prime. This is known as sieving the natural numbers, so that in the end all the composites are eliminated and what we are left with are just primes. (This is what the last "pseudo"-formulas are describing, though seemingly impredicative, because of being self-referential.)

Prototypically, it is

primes = -- foldl (\\) [2..] [[p+p, p+p+p..] | p <- primes]
   map head $ scanl (\\) [2..] [[p, p+p..] | p <- primes]
              where 
                    (\\) = Data.List.Ordered.minus

Having (a chain of) direct-access mutable arrays indeed enables easy marking of these multiples as is usually done in imperative languages; but to get an efficient list-based code we have to be smart about combining those streams of multiples of each prime - which gives us also the memory efficiency in generating the results incrementally, one by one.

Short exposition is here.

Sieve of Eratosthenes

Simplest, bounded, very inefficient formulation:

import Data.List (\\)  -- (\\) is set-difference for unordered lists

primesTo m = sieve [2..m]
             where 
             sieve (x:xs) = x : sieve (xs \\ [x,x+x..m])
             sieve [] = []
      -- or:
           = ps 
             where
             ps = map head $ takeWhile (not.null) 
                           $ scanl (\\) [2..m] [[p, p+p..m] | p <- ps]

The (unbounded) sieve of Eratosthenes calculates primes as integers above 1 that are not multiples of primes, i.e. not composite — whereas composites are found as enumeration of multiples of each prime, generated by counting up from prime's square in constant increments equal to that prime (or twice that much, for odd primes). This is much more efficient and runs at about n^1.2 empirical orders of growth (corresponding to n (log n)² log log n complexity, more or less, in n primes produced):

import Data.List.Ordered (minus, union, unionAll)

primes = 2 : 3 : minus [5,7..] (unionAll [[p*p, p*p+2*p..] | p <- tail primes])

{- Using `under n = takeWhile (<= n)`, with ordered increasing lists,
   `minus`, `union` and `unionAll` satisfy, for any `n` and `m`:

  under n (minus a b)         == nub . sort $ under n a \\ under n b
  under n (union a b)         == nub . sort $ under n a ++ under n b
  under n . unionAll . take m == under n . foldl union [] . take m
  under n . unionAll          == nub . sort . concat 
                                     . takeWhile (not.null) . map (under n) -}

The definition is primed with 2 and 3 as initial primes, to avoid the vicious circle.

The "big union" unionAll function could be defined as the folding of (\(x:xs) -> (x:) . union xs); or it could use a Bool array as a sorting and duplicates-removing device. The processing naturally divides up into the segments between successive squares of primes.

Stepwise development follows (the fully developed version is here).

Initial definition

First of all, working with ordered increasing lists, the sieve of Eratosthenes can be genuinely represented by

 -- genuine yet wasteful sieve of Eratosthenes
 -- primes = eratos [2.. ]               -- unbounded
primesTo m = eratos [2..m]               -- bounded, up to m
    where
    eratos []     = []
    eratos (p:xs) = p : eratos (xs `minus` [p, p+p..])
 -- eratos (p:xs) = p : eratos (xs `minus` map (p*) [1..])
 -- eulers (p:xs) = p : eulers (xs `minus` map (p*) (p:xs))    
 -- turner (p:xs) = p : turner [x | x <- xs, rem x p /= 0]  

 -- fix ( map head . scanl minus [2..] . map (\p-> [p, p+p..]) )

This should be regarded more like a specification, not a code. It runs at empirical orders of growth worse than quadratic in number of primes produced. But it has the core defining features of the classical formulation of S. of E. as a. being bounded, i.e. having a top limit value, and b. finding out the multiples of a prime directly, by counting up from it in constant increments, equal to that prime.

The canonical list-difference minus and duplicates-removing union functions (cf. Data.List.Ordered) are:

 -- ordered lists, difference and union
minus (x:xs) (y:ys) = case (compare x y) of 
           LT -> x : minus  xs  (y:ys)
           EQ ->     minus  xs     ys 
           GT ->     minus (x:xs)  ys
minus  xs     _     = xs
union (x:xs) (y:ys) = case (compare x y) of 
           LT -> x : union  xs  (y:ys)
           EQ -> x : union  xs     ys 
           GT -> y : union (x:xs)  ys
union  xs     []    = xs
union  []     ys    = ys

The name merge ought to be reserved for duplicates-preserving merging operation of the merge sort.

Analysis

So for each newly found prime p the sieve discards its multiples, enumerating them by counting up in steps of p. There are thus $O(m/p)$ multiples generated and eliminated for each prime, and $O(m\log \log(m))$ multiples in total, with duplicates, by virtues of prime harmonic series.

If each multiple is dealt with in $O(1)$ time, this will translate into $O(m\log \log(m))$ RAM machine operations (since we consider addition as an atomic operation). Indeed mutable random-access arrays allow for that. But lists in Haskell are sequential-access, and complexity of minus(a,b) for lists is $\textstyle O(|a\cup b|)$ instead of $\textstyle O(|b|)$ of the direct access destructive array update. The lower the complexity of each minus step, the better the overall complexity.

So on k-th step the argument list (p:xs) that the eratos function gets, starts with the (k+1)-th prime, and consists of all the numbers ≤ m coprime with all the primes ≤ p. According to the M. O'Neill's article (p.10) there are $\textstyle \Phi (m,p)\in \Theta (m/\log p)$ such numbers.

It looks like $\textstyle \sum _{i=1}^{n}{1/log(p_{i})}=O(n/\log n)$ for our intents and purposes. Since the number of primes below m is $n=\pi (m)=O(m/\log(m))$ by the prime number theorem (where $\pi (m)$ is a prime counting function), there will be n multiples-removing steps in the algorithm; it means total complexity of at least $O(mn/\log(n))=O(m^{2}/(\log(m))^{2})$ , or $O(n^{2})$ in n primes produced - much much worse than the optimal $O(n\log(n)\log \log(n))$ .

From Squares

But we can start each elimination step at a prime's square, as its smaller multiples will have been already produced and discarded on previous steps, as multiples of smaller primes. This means we can stop early now, when the prime's square reaches the top value m, and thus cut the total number of steps to around $\textstyle n=\pi (m^{0.5})=\Theta (2m^{0.5}/\log m)$ . This does not in fact change the complexity of random-access code, but for lists it makes it $O(m^{1.5}/(\log m)^{2})$ , or $O(n^{1.5}/(\log n)^{0.5})$ in n primes produced, a dramatic speedup:

primesToQ m = eratos [2..m] 
    where
    eratos []     = []
    eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..m])
 -- eratos (p:xs) = p : eratos (xs `minus` map (p*) [p..div m p])
 -- eulers (p:xs) = p : eulers (xs `minus` map (p*) (under (div m p) (p:xs)))
 -- turner (p:xs) = p : turner [x | x<-xs, x<p*p || rem x p /= 0]

Its empirical complexity is around $O(n^{1.45})$ . This simple optimization works here because this formulation is bounded (by an upper limit). To start late on a bounded sequence is to stop early (starting past end makes an empty sequence – see warning below^₁), thus preventing the creation of all the superfluous multiples streams which start above the upper bound anyway (note that Turner's sieve is unaffected by this). This is acceptably slow now, striking a good balance between clarity, succinctness and efficiency.

^₁Warning: this is predicated on a subtle point of minus xs [] = xs definition being used, as it indeed should be. If the definition minus (x:xs) [] = x:minus xs [] is used, the problem is back and the complexity is bad again.

Guarded

This ought to be explicated (improving on clarity, though not on time complexity) as in the following, for which it is indeed a minor optimization whether to start from p or p*p - because it explicitly stops as soon as possible:

primesToG m = 2 : sieve [3,5..m]
    where
    sieve (p:xs) 
       | p*p > m   = p : xs
       | otherwise = p : sieve (xs `minus` [p*p, p*p+2*p..])
                  -- p : sieve (xs `minus` map (p*) [p,p+2..])
                  -- p : eulers (xs `minus` map (p*) (p:xs))

(here we also flatly ignore all evens above 2 a priori.) It is now clear that it can't be made unbounded just by abolishing the upper bound m, because the guard can not be simply omitted without changing the complexity back for the worst.

Accumulating Array

So while minus(a,b) takes $O(|b|)$ operations for random-access imperative arrays and about $O(|a|)$ operations here for ordered increasing lists of numbers, when using Haskell's immutable array for a one could expect the array update time to be nevertheless closer to $O(|b|)$ if destructive update was used implicitly by compiler for an array being passed along as an accumulating state parameter:

{-# OPTIONS_GHC -O2 #-}
import Data.Array.Unboxed

primesToA m = sieve 3 (array (3,m) [(i,odd i) | i<-[3..m]]
                        :: UArray Int Bool)
    where
    sieve p a 
       | p*p > m   = 2 : [i | (i,True) <- assocs a]
       | a!p       = sieve (p+2) $ a // [(i,False) | i <- [p*p, p*p+2*p..m]]
       | otherwise = sieve (p+2) a

Indeed for unboxed arrays (suggested by Daniel Fischer; with regular, boxed arrays it is very slow), the above code runs pretty fast, but with empirical complexity of O(n^1.15..1.45) in n primes produced (for producing from few hundred thousands to few millions primes, memory usage also slowly growing). If the update for each index were an O(1) operation, the empirical complexity would be seen as diminishing, as O(n^1.15..1.05), reflecting the true, linearithmic complexity.

We could use explicitly mutable monadic arrays (see below) to remedy this, but we can also think about it a little bit more on the functional side of things still.

Postponed

Going back to guarded Eratosthenes, first we notice that though it works with minimal number of prime multiples streams, it still starts working with each prematurely. Fixing this with explicit synchronization won't change complexity but will speed it up some more:

primesPE1 = 2 : sieve [3..] primesPE1 
    where 
    sieve xs (p:pt) | q <- p*p , (h,t) <- span (< q) xs =
                   h ++ sieve (t `minus` [q, q+p..]) pt
                -- h ++ turner [x | x<-t, rem x p>0] pt

Inlining and fusing span and (++) we get:

primesPE = 2 : sieve [3..] [[p*p, p*p+p..] | p <- primesPE] 
               where
               sieve (x:xs) t@((q:cs):r)
                    | x < q = x : sieve xs t
                    | otherwise = sieve (minus xs cs) r

Since the removal of a prime's multiples here starts at the right moment, and not just from the right place, the code could now finally be made unbounded. Because no multiples-removal process is started prematurely, there are no extraneous multiples streams, which were the reason for the original formulation's extreme inefficiency.

Segmented

With work done segment-wise between the successive squares of primes it becomes

primesSE = 2 : ops
    where
    ops = sieve 3 9 ops []                                -- odd primes
    sieve x q ~(p:pt) fs = 
        foldr (flip minus) [x,x+2..q-2]                   -- chain of subtractions
                           [[y+s, y+2*s..q] | (s,y) <- fs]     -- OR,
        -- [x,x+2..q-2] `minus` foldl union []            -- subtraction of merged
        --                    [[y+s, y+2*s..q] | (s,y) <- fs]            --  lists
        ++ sieve (q+2) (head pt^2) pt
                ((2*p,q):[(s,q-rem (q-y) s) | (s,y) <- fs])

This "marks" the odd composites in a given range by generating them - just as a person performing the original sieve of Eratosthenes would do, counting one by one the multiples of the relevant primes. These composites are independently generated so some will be generated multiple times.

The advantage to working in spans explicitly is that this code is easily amendable to using arrays for the composites marking and removal on each finite span; and memory usage can be kept in check by using fixed sized segments.

Segmented Tree-merging

Rearranging the chain of subtractions into a subtraction of merged streams (see below) and using tree-like folding structure, further speeds up the code and significantly improves its asymptotic time behavior (down to about $O(n^{1.28}empirically)$ , space is leaking though):

primesSTE = 2 : ops
    where                  
    ops = sieve 3 9 ops []                                -- odd primes
    sieve x q ~(p:pt) fs = 
        ([x,x+2..q-2] `minus` joinST [[y+s, y+2*s..q] | (s,y) <- fs])
        ++ sieve (q+2) (head pt^2) pt
                   ((++ [(2*p,q)]) [(s,q-rem (q-y) s) | (s,y) <- fs])
    
joinST (xs:t) = (union xs . joinST . pairs) t
    where
    pairs (xs:ys:t) = union xs ys : pairs t
    pairs t         = t
joinST []     = []

Segmented merging via an array

The removal of composites is easy with arrays. Starting points can be calculated directly:

import Data.List (inits, tails)
import Data.Array.Unboxed
 
primesSAE = 2 : sieve 2 4 (tail primesSAE) (inits primesSAE)
         -- (2:) . (sieve 2 4 . tail <*> inits) $ primesSAE
  where
  sieve r q ps (fs:ft) = [n | (n,True) <- assocs (
         accumArray (\ _ _ -> False) True (r+1,q-1)
                    [(m,()) | p <- fs, let s = p * div (r+p) p,
                              m <- [s,s+p..q-1]] :: UArray Int Bool )]
      ++ sieve q (head ps^2) (tail ps) ft

The pattern of iterated calls to tail is captured by a higher-order function tails, which explicitly generates the stream of tails of a stream, making for a bit more readable (even if possibly a bit less efficient) code:

psSAGE = 2 : [n | (r:q:_, fs) <- (zip . tails . (2:) . map (^2) <*> inits) psSAGE,
                  (n,True)    <- assocs (
                                   accumArray (\_ _ -> False) True (r+1, q-1) 
                        [(m,()) | p <- fs, let s = (r+p)`div`p*p, 
                                  m <- [s,s+p..q-1]] :: UArray Int Bool )]

Linear merging

But segmentation doesn't add anything substantially, and each multiples stream starts at its prime's square anyway. What does the Postponed code do, operationally? With each prime's square passed by, there emerges a nested linear left-deepening structure, (...((xs-a)-b)-...), where xs is the original odds-producing [3,5..] list, so that each odd it produces must go through more and more minus nodes on its way up - and those odd numbers that eventually emerge on top are prime. Thinking a bit about it, wouldn't another, right-deepening structure, (xs-(a+(b+...))), be better? This idea is due to Richard Bird, seen in his code presented in M. O'Neill's article, equivalent to:

primesB = 2 : minus [3..] (foldr (\p r-> p*p : union [p*p+p, p*p+2*p..] r) 
                                 [] primesB)

or,

primesLME1 = 2 : prs
    where
    prs = 3 : minus [5,7..] (joinL [[p*p, p*p+2*p..] | p <- prs])
    
joinL ((x:xs):t) = x : union xs (joinL t)

Here, xs stays near the top, and more frequently odds-producing streams of multiples of smaller primes are above those of the bigger primes, that produce less frequently their multiples which have to pass through more union nodes on their way up. Plus, no explicit synchronization is necessary anymore because the produced multiples of a prime start at its square anyway - just some care has to be taken to avoid a runaway access to the indefinitely-defined structure, defining joinL (or foldr's combining function) to produce part of its result before accessing the rest of its input (thus making it productive).

Melissa O'Neill introduced double primes feed to prevent unneeded memoization (a memory leak). We can even do multistage. Here's the code, faster still and with radically reduced memory consumption, with empirical orders of growth of around ~ $n^{1.40}$ (initially better, yet worsening for bigger ranges):

primesLME = 2 : _Y ((3:) . minus [5,7..] . joinL . map (\p-> [p*p, p*p+2*p..]))

_Y   :: (t -> t) -> t
_Y g = g (_Y g)                -- multistage, non-sharing,  g (g (g (g ...)))
    -- g (let x = g x in x)    -- two g stages, sharing

_Y is a non-sharing fixpoint combinator, here arranging for a recursive "telescoping" multistage primes production (a tower of producers).

This allows the primesLME stream to be discarded immediately as it is being consumed by its consumer. For prs from primesLME1 definition above it is impossible, as each produced element of prs is needed later as input to the same prs corecursive stream definition. So the prs stream feeds in a loop into itself and is thus retained in memory, being consumed by self much slower than it is produced. With multistage production, each stage feeds into its consumer above it at the square of its current element which can be immediately discarded after it's been consumed. (3:) jump-starts the whole thing.

Tree merging

Moreover, it can be changed into a tree structure. This idea is due to Dave Bayer and Heinrich Apfelmus:

primesTME = 2 : _Y ((3:) . gaps 5 . joinT . map (\p-> [p*p, p*p+2*p..]))

-- joinL ((x:xs):t) = x : union xs (joinL t) 
joinT ((x:xs):t) = x : union xs (joinT (pairs t))    -- set union, ~=
  where  pairs (xs:ys:t) = union xs ys : pairs t     --    nub.sort.concat

gaps k s@(x:xs) | k < x = k:gaps (k+2) s       -- ~= [k,k+2..]\\s, 
                | True  =   gaps (k+2) xs      --   when null(s\\[k,k+2..])

This code is pretty fast, running at speeds and empirical complexities comparable with the code from Melissa O'Neill's article (about $O(n^{1.2})$ in number of primes n produced).

As an aside, joinT is equivalent to infinite tree-like folding foldi (\(x:xs) ys-> x:union xs ys) []:

Data.List.Ordered.foldt of the data-ordlist package builds the same structure, but in a lazier fashion, consuming its input at the slowest pace possible. Here this sophistication is not needed (evidently).

Tree merging with Wheel

Wheel factorization optimization can be further applied, and another tree structure can be used which is better adjusted for the primes multiples production (effecting about 5%-10% at the top of a total 2.5x speedup w.r.t. the above tree merging on odds only, for first few million primes):

primesTMWE = [2,3,5,7] ++ _Y ((11:) . tail  . gapsW 11 wheel 
                                    . joinT . hitsW 11 wheel)
    
gapsW k (d:w) s@(c:cs) | k < c     = k : gapsW (k+d) w s    -- set difference
                       | otherwise =     gapsW (k+d) w cs   --   k==c
hitsW k (d:w) s@(p:ps) | k < p     =     hitsW (k+d) w s    -- intersection
                       | otherwise = scanl (\c d->c+p*d) (p*p) (d:w) 
                                       : hitsW (k+d) w ps   --   k==p 

wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:
        4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel
  -- cycle $ zipWith (-) =<< tail $ [i | i <- [11..11+_210], gcd i _210 == 1]
  --   where _210 = product [2,3,5,7]

The hitsW function is there to find the starting point for rolling the wheel for each prime, but this can be found directly:

primesW = [2,3,5,7] ++ _Y ( (11:) . tail . gapsW 11 wheel . joinT .
                            map (\p ->  
                              map (p*) . dropWhile (< p) $
                                scanl (+) (p - rem (p-11) 210) wheel) )

Seems to run about 1.4x faster, too.

Above Limit - Offset Sieve

Another task is to produce primes above a given value:

{-# OPTIONS_GHC -O2 -fno-cse #-}
primesFromTMWE primes m = dropWhile (< m) [2,3,5,7,11] 
                          ++ gapsW a wh2 (compositesFrom a)
    where
    (a,wh2) = rollFrom (snapUp (max 3 m) 3 2)
    (h,p2:t) = span (< z) $ drop 4 primes           -- p < z => p*p<=a
    z = ceiling $ sqrt $ fromIntegral a + 1         -- p2>=z => p2*p2>a
    compositesFrom a = joinT (joinST [multsOf p  a    | p <- h ++ [p2]]
                                   : [multsOf p (p*p) | p <- t] )

snapUp v o step = v + (mod (o-v) step)              -- full steps from o
multsOf p from = scanl (\c d->c+p*d) (p*x) wh       -- map (p*) $
    where                                           --   scanl (+) x wh
    (x,wh) = rollFrom (snapUp from p (2*p) `div` p) --   , if p < from 

wheelNums  = scanl (+) 0 wheel
rollFrom n = go wheelNums wheel 
    where 
    m = (n-11) `mod` 210  
    go (x:xs) ws@(w:ws2) | x < m = go xs ws2
                         | True  = (n+x-m, ws)      -- (x >= m)

A certain preprocessing delay makes it worthwhile when producing more than just a few primes, otherwise it degenerates into simple trial division, which is then ought to be used directly:

primesFrom m = filter isPrime [m..]

Map-based

Runs ~1.7x slower than TME version, but with the same empirical time complexity, ~ $n^{1.2}$ (in n primes produced) and same very low (near constant) memory consumption:

import Data.List                 -- based on 
import qualified Data.Map as M   --   http://stackoverflow.com/a/1140100

primesMPE :: [Integer]
primesMPE = 2 : mkPrimes 3 M.empty prs 9   -- postponed sieve enlargement
    where                                 -- by decoupled primes feed loop
    prs = 3 : mkPrimes 5 M.empty prs 9
    mkPrimes n m ps@ ~(p:pt) q = case (M.null m, M.findMin m) of
      { (False, (n2, skips)) | n == n2 ->
             mkPrimes (n+2) (addSkips n (M.deleteMin m) skips) ps q
      ; _ -> if n < q
             then    n : mkPrimes (n+2)  m                  ps q
             else        mkPrimes (n+2) (addSkip n m (2*p)) pt (head pt^2)
      }
    addSkip n m s = M.alter (Just . maybe [s] (s:)) (n+s) m
    addSkips = foldl' . addSkip

Turner's sieve - Trial division

David Turner's (SASL Language Manual, 1983) formulation replaces non-standard minus in the sieve of Eratosthenes by stock list comprehension with rem filtering, turning it into a trial division algorithm, for clarity and simplicity:

  -- unbounded sieve, premature filters
primesT = sieve [2..]
          where
          sieve (p:xs) = p : sieve [x | x <- xs, rem x p > 0]

  -- map head
  --    $ iterate (\(p:xs) -> [x | x <- xs, rem x p > 0]) [2..]

This creates many superfluous implicit filters, because they are created prematurely. To be admitted as prime, each number will be tested for divisibility here by all its preceding primes, while just those not greater than its square root would suffice. To find e.g. the 1001st prime (7927), 1000 filters are used, when in fact just the first 24 are needed (up to 89's filter only). Operational overhead here is huge; theoretically, it has quadratic time complexity, in the number of produced primes.

Guarded Filters

But this really ought to be changed into the abortive variant, again achieving the "miraculous" complexity improvement from above quadratic to about $O(n^{1.45})$ empirically (in n primes produced) by stopping the sieving as soon as possible:

primesToGT m = sieve [2..m]
    where
    sieve (p:xs) 
        | p*p > m = p : xs
        | True    = p : sieve [x | x <- xs, rem x p > 0]

  -- (\(a,b:_) -> map head a ++ b) . span ((< m).(^2).head) $
  --     iterate (\(p:xs) -> [x | x <- xs, rem x p > 0]) [2..m]

Postponed Filters

Or it can remain unbounded, just filters creation must be postponed until the right moment:

primesPT1 = 2 : sieve primesPT1 [3..] 
    where 
    sieve (p:pt) xs = let (h,t) = span (< p*p) xs 
                      in h ++ sieve pt [x | x <- t, rem x p > 0]

  -- fix $ concatMap (fst . fst)
  --     . iterate (\((_,xs), p:pt) -> let (h,t) = span (< p*p) xs in
  --                              ((h, [x | x <- t, rem x p > 0]), pt)) 
  --     . (,) ([2],[3..])

It can be re-written with span and (++) inlined and fused into the sieve:

primesPT = 2 : oddprimes
    where 
    oddprimes = sieve [3,5..] 9 oddprimes
    sieve (x:xs) q ps@ ~(p:pt)
        | x < q = x : sieve xs q ps
        | True  =     sieve [x | x <- xs, rem x p /= 0] (head pt^2) pt

creating here as well the linear filtering nested structure at run-time, (...(([3,5..] >>= filterBy [3]) >>= filterBy [5])...), but unlike the non-postponed code each filter being created at its proper moment, not sooner than the prime's square is seen.

filterBy ds n = [n | noDivs n ds]    -- `ds` assumed to be non-decreasing

noDivs n ds = foldr (\d r -> d*d > n || (rem n d > 0 && r)) True ds

Optimal trial division

The above is algorithmically equivalent to the traditional formulation of trial division,

ps = 2 : [i | i <- [3..],  
              and [rem i p > 0 | p <- takeWhile (\p -> p^2 <= i) ps]]

or,

-- primes = filter (`noDivs`[2..]) [2..]
--        = 2 : filter (`noDivs`[3,5..]) [3,5..]
primesTD  = 2 : 3 : filter (`noDivs` tail primesTD) [5,7..]

isPrime n = n > 1 && noDivs n primesTD

except that this code is rechecking for each candidate number which primes to use, whereas for every candidate number in each segment between the successive squares of primes these will just be the same prefix of the primes list being built.

Trial division is used as a simple primality test and prime factorization algorithm.

Segmented Generate and Test

Next we turn the list of filters into one filter by an explicit list, each one in a progression of prefixes of the primes list. This seems to eliminate most recalculations, explicitly filtering composites out from batches of odds between the consecutive squares of primes.

import Data.List

primesST = 2 : ops                 
    where
    ops = sieve 3 9 ops (inits ops)               -- odd primes
        -- (sieve 3 9 <*> inits) ops              -- inits: [],[3],[3,5],...
    sieve x q ~(_:pt) (fs:ft) =
        filter ((`all` fs) . ((> 0).) . rem) [x,x+2..q-2]
        ++ sieve (q+2) (head pt^2) pt ft

This can also be coded as, arguably more readable,

 primesSGT = 
    {- 2 : [ n | (px, r:q:_) <- zip (inits primesSGT) 
                                    (tails $ 2 : map (^2) primesSGT),
                 n <- [r+1..q-1],  all ((> 0) . rem n) px ]
     -}
       2 : ops
    where
    ops = 3 : [n | (px, r:q:_) <- zip (inits ops) (tails $ 3 : map (^2) ops),
                   n <- [r+2,r+4..q-2],  all ((> 0) . rem n) px]

                -- n <- foldl (>>=) [r+2,r+4..q-2]           -- chain of filters
                --                   [filterBy [p] | p <- px]]        -- OR,

                -- n <- [r+2,r+4..q-2] >>= filterBy px]      -- a filter by a list

Generate and Test Above Limit

The following will start the segmented Turner sieve at the right place, using any primes list it's supplied with (e.g. TMWE etc.) or itself, as shown, demand computing it just up to the square root of any prime it'll produce:

primesFromST m | m <= 2 = 2 : primesFromST 3
primesFromST m | m > 2 =
         sieve (m`div`2*2+1) (head ps^2) (tail ps) (inits ps)
    where 
    (h,ps) = span (<= (floor.sqrt $ fromIntegral m+1)) ops
    sieve x q ps (fs:ft) =
        filter ((`all` (h ++ fs)) . ((> 0) .) . rem) [x,x+2..q-2]
        ++ sieve (q+2) (head ps^2) (tail ps) ft
    ops = 3 : primesFromST 5                      -- odd primes

-- ~> take 3 $ primesFromST 100000001234
-- [100000001237,100000001239,100000001249]

This is usually faster than testing candidate numbers for divisibility one by one which has to re-fetch anew the needed prime factors to test by, for each candidate. Faster is the offset sieve of Eratosthenes on odds, and yet faster the one w/ wheel optimization, on this page.

Conclusions

All these variants being variations of trial division, finding out primes by direct divisibility testing of every candidate number by sequential primes below its square root (instead of just by its factors, which is what direct generation of multiples is doing, essentially), are thus principally of worse complexity than that of Sieve of Eratosthenes.

The initial code is just a one-liner that ought to have been regarded as executable specification in the first place. It can easily be improved quite significantly with a simple use of bounded, guarded formulation to limit the number of filters it creates, or by postponement of filter creation.

Euler's Sieve

Unbounded Euler's sieve

With each found prime Euler's sieve removes all its multiples in advance so that at each step the list to process is guaranteed to have no multiples of any of the preceding primes in it (consists only of numbers coprime with all the preceding primes) and thus starts with the next prime:

primesEU = 2 : eulers [3,5..]
  where
    eulers (p:xs) = p : eulers (xs `minus` map (p*) (p:xs))
 -- eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+2*p..])

This code is extremely inefficient, running above $O({n^{2}})$ empirical complexity (and worsening rapidly), and should be regarded a specification only. Its memory usage is very high, with empirical space complexity just below $O({n^{2}})$ , in n primes produced.

In the stream-based sieve of Eratosthenes we are able to skip along the input stream xs directly to the prime's square, consuming the whole prefix at once, thus achieving the results equivalent to the postponement technique, because the generation of the prime's multiples is independent of the rest of the stream.

But here in the Euler's sieve it is dependent on all xs and we're unable in principle to skip along it to the prime's square - because all xs are needed for each prime's multiples generation. Thus efficient unbounded stream-based implementation seems to be impossible in principle, under the simple scheme of producing the multiples by multiplication.

Wheeled list representation

The situation can be somewhat improved using a different list representation, for generating lists not from a last element and an increment, but rather a last span and an increment, which entails a set of helpful equivalences:

{- fromElt (x,i) = x : fromElt (x + i,i)
                       === iterate  (+ i) x
     [n..]             === fromElt  (n,1) 
                       === fromSpan ([n],1) 
     [n,n+2..]         === fromElt  (n,2)    
                       === fromSpan ([n,n+2],4)     -}

fromSpan (xs,i)  = xs ++ fromSpan (map (+ i) xs,i)

{-                   === concat $ iterate (map (+ i)) xs
   fromSpan (p:xt,i) === p : fromSpan (xt ++ [p + i], i)  
   fromSpan (xs,i) `minus` fromSpan (ys,i) 
                     === fromSpan (xs `minus` ys, i)  
   map (p*) (fromSpan (xs,i)) 
                     === fromSpan (map (p*) xs, p*i)
   fromSpan (xs,i)   === forall (p > 0).
     fromSpan (concat $ take p $ iterate (map (+ i)) xs, p*i) -}

spanSpecs = iterate eulerStep ([2],1)
eulerStep (xs@(p:_), i) = 
       ( (tail . concat . take p . iterate (map (+ i))) xs
          `minus` map (p*) xs, p*i )

{- > mapM_ print $ take 4 spanSpecs 
     ([2],1)
     ([3],2)
     ([5,7],6)
     ([7,11,13,17,19,23,29,31],30)  -}

Generating a list from a span specification is like rolling a wheel as its pattern gets repeated over and over again. For each span specification w@((p:_),_) produced by eulerStep, the numbers in (fromSpan w) up to ${p^{2}}$ are all primes too, so that

eulerPrimesTo m = if m > 1 then go ([2],1) else []
    where
      go w@((p:_), _) 
         | m < p*p = takeWhile (<= m) (fromSpan w)
         | True    = p : go (eulerStep w)

This runs at about $O(n^{1.5..1.8})$ complexity, for n primes produced, and also suffers from a severe space leak problem (IOW its memory usage is also very high).

Using Immutable Arrays

Generating Segments of Primes

The sieve of Eratosthenes' removal of multiples on each segment of odds can be done by actually marking them in an array, instead of manipulating ordered lists, and can be further sped up more than twice by working with odds only:

import Data.Array.Unboxed
 
primesSA :: [Int]
primesSA = 2 : oddprimes ()
  where
    oddprimes = (3 :) . sieve 3 [] . oddprimes
    sieve x fs (p:ps) = [i*2 + x | (i,True) <- assocs a]
                        ++ sieve (p*p) ((p,0) :
                             [(s, rem (y-q) s) | (s,y) <- fs]) ps
      where
      q = (p*p-x)`div`2
      a :: UArray Int Bool
      a = accumArray (\ b c -> False) True (1,q-1)
                     [(i,()) | (s,y) <- fs, i <- [y+s, y+s+s..q]]

Runs significantly faster than TMWE and with better empirical complexity, of about $O(n^{1.10..1.05})$ in producing first few millions of primes, with constant memory footprint.

Calculating Primes Upto a Given Value

Equivalent to Accumulating Array above, running somewhat faster (compiled by GHC with optimizations turned on):

{-# OPTIONS_GHC -O2 #-}
import Data.Array.Unboxed

primesToNA n = 2: [i | i <- [3,5..n], ar ! i]
  where
    ar = f 5 $ accumArray (\ a b -> False) True (3,n) 
                        [(i,()) | i <- [9,15..n]]
    f p a | q > n = a
          | True  = if null x then a2 else f (head x) a2
      where q = p*p
            a2  :: UArray Int Bool
            a2 = a // [(i,False) | i <- [q, q+2*p..n]]
            x  = [i | i <- [p+2,p+4..n], a2 ! i]

Calculating Primes in a Given Range

primesFromToA a b = (if a<3 then [2] else []) 
                      ++ [i | i <- [o,o+2..b], ar ! i]
  where 
    o  = max (if even a then a+1 else a) 3   -- first odd in the segment
    r  = floor . sqrt $ fromIntegral b + 1
    ar = accumArray (\_ _ -> False) True (o,b)  -- initially all True,
          [(i,()) | p <- [3,5..r]
                    , let q  = p*p      -- flip every multiple of an odd 
                          s  = 2*p                         -- to False
                          (n,x) = quotRem (o - q) s 
                          q2 = if  o <= q  then q
                               else  q + (n + signum x)*s
                    , i <- [q2,q2+s..b] ]

Although sieving by odds instead of by primes, the array generation is so fast that it is very much feasible and even preferable for quick generation of some short spans of relatively big primes.

Using Mutable Arrays

Using mutable arrays is the fastest but not the most memory efficient way to calculate prime numbers in Haskell.

Using ST Array

This method implements the Sieve of Eratosthenes, similar to how you might do it in C, modified to work on odds only. It is fast, but about linear in memory consumption, allocating one (though apparently packed) sieve array for the whole sequence to process.

import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed

sieveUA :: Int -> UArray Int Bool
sieveUA top = runSTUArray $ do
    let m = (top-1) `div` 2
        r = floor . sqrt $ fromIntegral top + 1
    sieve <- newArray (1,m) True      -- :: ST s (STUArray s Int Bool)
    forM_ [1..r `div` 2] $ \i -> do
      isPrime <- readArray sieve i
      when isPrime $ do               -- ((2*i+1)^2-1)`div`2 == 2*i*(i+1)
        forM_ [2*i*(i+1), 2*i*(i+2)+1..m] $ \j -> do
          writeArray sieve j False
    return sieve
 
primesToUA :: Int -> [Int]
primesToUA top = 2 : [i*2+1 | (i,True) <- assocs $ sieveUA top]

Its empirical time complexity is improving with n (number of primes produced) from above $O(n^{1.20})$ towards $O(n^{1.16})$ . The reference C++ vector-based implementation exhibits this improvement in empirical time complexity too, from $O(n^{1.5})$ gradually towards $O(n^{1.12})$ , where tested (which might be interpreted as evidence towards the expected quasilinearithmic $O(n\log(n)\log(\log n))$ time complexity).

Using Page-Segmented ST-Mutable Unboxed Array

The following code is similar to the above in using (automatically by default) bit-packed unboxed arrays sieving "odds-only", but adds the feature of page-segmentation where each page is about the size of the CPU L2 cache (2^20 or about a million bits in the example) so can sieve reasonably efficiently without "cache thrashing" to this number squared or about 10^12. Using page segmentation, it is able to sieve to produce the primes to this limit only using the memory space required for three of these buffers, a (recursive) two used to sieve for the base primes up to a million (actually up to two million but only a million are used to this limit) and the third used for the page segments where they represent increasing values as the computation progresses.

Thus, only less than 400 Kilobytes is used sieving up to the limit of 10^12 rather than the about 62.5 Gigabytes required for the one-huge-monolithic array version (even bit-packed), and even if a computer had this much memory, the huge cost in memory access time of up up to about a hundred CPU clock cycles per access due to "cache-thrashing" as compared to the average of four or five CPU clock cycles as used in this code.

If one places the following GHC Haskell code in a file called TestPrimes.hs:

{-# OPTIONS_GHC -O2 #-}

import Data.Int ( Int64 )
import Data.Word ( Word64 )
import Data.Bits ( Bits(shiftR) )
import Data.Array.Base ( IArray(unsafeAt), UArray(UArray),     
                         MArray(unsafeWrite), unsafeFreezeSTUArray ) 
import Control.Monad ( forM_ )
import Data.Array.ST ( MArray(newArray), runSTUArray )

type Prime = Word64

cSieveBufferLimit :: Int
cSieveBufferLimit = 2^17 * 8 - 1 -- CPU L2 cache in bits

primes :: () -> [Prime]
primes() = 2 : _Y (listPagePrms . pagesFrom 0) where
  _Y g = g (_Y g) -- non-sharing multi-stage fixpoint combinator
  listPagePrms pgs@(hdpg@(UArray lwi _ rng _) : tlpgs) =
    let loop i | i >= fromIntegral rng = listPagePrms tlpgs
               | unsafeAt hdpg i = loop (i + 1)
               | otherwise = let ii = lwi + fromIntegral i in
                             case fromIntegral $ 3 + ii + ii of
                               p -> p `seq` p : loop (i + 1) in loop 0
  makePg lwi bps = runSTUArray $ do
    let limi = lwi + fromIntegral cSieveBufferLimit
        bplmt = floor $ sqrt $ fromIntegral $ limi + limi + 3
        strta bp = let si = fromIntegral $ (bp * bp - 3) `shiftR` 1
                   in if si >= lwi then fromIntegral $ si - lwi else
                   let r = fromIntegral (lwi - si) `mod` bp
                   in if r == 0 then 0 else fromIntegral $ bp - r
    cmpsts <- newArray (lwi, limi) False
    fcmpsts <- unsafeFreezeSTUArray cmpsts
    let cbps = if lwi == 0 then listPagePrms [fcmpsts] else bps
    forM_ (takeWhile (<= bplmt) cbps) $ \ bp ->
      forM_ (let sp = fromIntegral $ strta bp
             in [ sp, sp + fromIntegral bp .. cSieveBufferLimit ]) $ \c ->
        unsafeWrite cmpsts c True
    return cmpsts
  pagesFrom lwi bps = map (`makePg` bps)
                          [ lwi, lwi + fromIntegral cSieveBufferLimit + 1 .. ]

main :: IO ()
main = print $ last $ take (10^ (10 :: Int)) $ primes()

Run as (for Linux-like systems):

$ ghc TestPrimes
$ time ./TestPrimes
179424673

real    0m0.714s
user    0m0.703s
sys     0m0.009s

The above code finds the first ten million primes in about 0.71 seconds, the first hundred million primes in about 7.6 seconds, the first thousand million (billion) primes in about 85.7, and the first ten billion primes in about 1087 seconds (Intel i5-6500 at 3.6 Ghz single threaded boost) for a empirical growth factor of about 1.1 to 1.2, which is about the theoretical limit.

Most of the above times are the time taken for the primes list processing, where the actual culling time is some fraction of the total time, with further optimizations plus multi-threading on multi-core CPU's able to make this small fraction even smaller. This is why conventional functional/incremental SoE's have very limited use for finding the first million primes or so as compared to use of mutable unboxed array based page-segmented SoE's which should be used for larger ranges, and have further optimizations reducing culling operation time by as much as an average of two times for these ranges, as well as the mentioned easy multi-threading by pages and further reductions in the number of operations through wheel-factorization, etc.

Implicit Heap

See Implicit Heap.

Prime Wheels

See Prime Wheels.

Using IntSet for a traditional sieve

See Using IntSet for a traditional sieve.

Prime Counting Functions

For counting of the number of primes up to a limit, just using a prime sieve is not the fastest way, especially when the ranges are over a few millions, and a prime counting function based on the inclusion-exclusion principle is much faster as although the operations are slower integer division operations rather than just simple adding and shifting integer operations, there are many less operations because the execution complexity is something like O(n^(2/3)) for Meissel based functions and O(n^(3/4)) for Legendre functions as per the one shown here. This operational complexity reduction comes about due to the inclusion-exclusion principle, which does not require knowledge of all the primes over the counting range as only a subset is required, which are found using a sieve. In this example, sieving to the square root of the counting range is a small part of the overall time so a basic odds-only Sieve of Eratosthenes is sufficient since most of the time will be spent determining the counts of primes to various points across the range.

The following code works by using the non-recursive partial sieving definition of the Legendre Phi function where for each base prime sieving pass up to the quad root of the counting range, a "smalls" Look Up Table (LUT) array of the counts of the remaining potential primes (which are only partially sieved, so may still be compound) is maintained, a LUT array of the remaining "roughs" or potential primes above the sieved base primes is updated, and a "larges" LUT array of the counting range divided by each of the remaining "roughs" is updated per sieving pass for each of the remaining potential "roughs". Following complete sieving, a final adjustment is made to compensate for the counting range divided by the product of pairs of primes above the quad root of the counting range where the primes must never be the same value. Since the LUT arrays represent actual count values (including the count of the sieving base primes) and not Phi values where the base primes are not included in the counts but one is, there is a compensation for this discrepancy at the end of each partial sieving loop and at the end of the "products of pairs" loop. Finally, after combining the counts into one result value, one is added to the count for the only even prime of two.

The following Legendre prime counting function code is iterative (not recursive although the iterative loops are expressed recursively) through the use of partial sieving where the culling array is sieved and counted one base prime at a time as is one interpretation of the Legendre Phi function (save as a file named PrimeCount.hs):

{-# OPTIONS_GHC -O2 -fllvm #-}
{-# LANGUAGE FlexibleContexts, BangPatterns #-}

import Data.Int (Int64, Int32)
import Data.Bits (Bits(shiftR), (.|.))
import Control.Monad (forM_)
import Control.Monad.ST (ST, runST)
import Data.Array.Base (STUArray, MArray(unsafeNewArray_), newArray,
                        unsafeAt, unsafeFreezeSTUArray, unsafeRead, unsafeWrite)

primeCount :: Int64 -> Int64
primeCount n =
  if n < 9 then (if n < 2 then 0 else (n + 1) `div` 2) else
  let
    {-# INLINE divide #-}
    divide :: Int64 -> Int64 -> Int
    divide nm d = truncate $ (fromIntegral nm :: Double) / fromIntegral d
    {-# INLINE half #-}
    half :: Int -> Int
    half x = (x - 1) `shiftR` 1
    rtlmt = floor $ sqrt (fromIntegral n :: Double) 
    mxndx = (rtlmt - 1) `div` 2
    (npc, ns, smalls, roughs, larges) = runST $ do
      mss <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int32)
      forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mss i (fromIntegral i)
      mrs <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int32)
      forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mrs i (fromIntegral i * 2 + 1)
      mls <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int64)
      forM_ [ 0 .. mxndx ] $ \ i ->
        let d = fromIntegral (i + i + 1)
        in unsafeWrite mls i (fromIntegral (divide n d - 1) `div` 2)
      skips <- newArray (0, mxndx) False :: ST s (STUArray s Int Bool)
      let loop i !pc !cs =
            let sqri = (i + i) * (i + 1) in
            if sqri > mxndx then do
              fss <- unsafeFreezeSTUArray mss
              frs <- unsafeFreezeSTUArray mrs
              fls <- unsafeFreezeSTUArray mls
              return (pc, cs + 1, fss, frs, fls)
            else do
              v <- unsafeRead skips i
              if v then loop (i + 1) pc cs else do
                unsafeWrite skips i True
                let bp = i + i + 1
                    cull c = if c > mxndx then return () else do
                               unsafeWrite skips c True; cull (c + bp)
                    part k !s =
                      if k > cs then return (s - 1) else do
                        p <- unsafeRead mrs k
                        t <- unsafeRead skips ((fromIntegral p - 1) `shiftR` 1)
                        if t then part (k + 1) s else do
                          olv <- unsafeRead mls k
                          let d = fromIntegral p * fromIntegral bp
                          adjv <- -- return olv
                                  if d <= fromIntegral rtlmt then do
                                    sv <- unsafeRead mss (fromIntegral d `shiftR` 1)
                                    unsafeRead mls (fromIntegral sv - pc)
                                  else do
                                    sv <- unsafeRead mss (half (divide n d))
                                    return (fromIntegral sv)
                          unsafeWrite mls s (olv - adjv + fromIntegral pc)
                          unsafeWrite mrs s p; part (k + 1) (s + 1)
                    k0 = ((rtlmt `div` bp) - 1) .|. 1
                    adjc !j k =
                      if k < bp then return () else do
                        c <- unsafeRead mss (k `shiftR` 1)
                        let ac = c - fromIntegral pc
                            e = (k * bp) `shiftR` 1
                            adj m = if m < e then adjc m (k - 2)
                                    else do ov <- unsafeRead mss m
                                            unsafeWrite mss m (ov - ac)
                                            adj (m - 1)
                        adj j
                cull sqri; his <- part 0 0; adjc mxndx k0; loop (i + 1) (pc + 1) his
      loop 1 0 mxndx
    ans0 = unsafeAt larges 0 + fromIntegral ((ns + 2 * (npc - 1)) * (ns - 1) `div` 2)
    accum j !ans =
      if j >= ns then ans else
      let q = fromIntegral (unsafeAt roughs j)
          m = n `div` q
          e = fromIntegral (unsafeAt smalls (half (fromIntegral (m `div` q)))) - npc
          comp k !ac =
            if k > e then ac else
            let p = fromIntegral (unsafeAt roughs k)
                ci = half (fromIntegral (divide m p))
            in comp (k + 1) (ac + fromIntegral (unsafeAt smalls ci))
      in if e <= j then ans else
         accum (j + 1) (comp (j + 1) ans - fromIntegral ((e - j) * (npc + j - 1)))
  in accum 1 (ans0 - sum [ unsafeAt larges i | i <- [ 1 .. ns - 1 ] ]) + 1

main :: IO ()
main = print $ primeCount (10^(11 :: Int))

Run as (Linux-like OS's):

ghc PrimeCount
time ./PrimeCount
4118054813

real    0m0.033s
user    0m0.029s
sys     0m0.004s

This time of about 29 milliseconds to count the primes to 10^11 and the ability to count the primes to 10^16 in about 73 seconds (Intel i5-6500 at 3.6 GHZ) shows the empirical growth of about 0.68, which is why it can count extended count ranges so fast where using a sieve to count the range to 10^16 with a computational complexity a little above one would take over a thousand hours single-threaded as here, even with the best of constant-factor wheel-factorization and loop-operation-time-reduction optimizations. The main problem with this algorithm is that it uses memory as the square root of the counting range at about 800 Megabytes for the range to 10^16.

Other prime counting function algorithms such as Lagarias, Miller, and Odlyzko's algorithm use memory proportional to the cube root of the counting range and have a slightly further reduced computational complexity while not significantly adding to the cost per operation as well as supporting easy multi-threading, but the code is somewhat long to show on this page of basic algorithms.

Prime counting function algorithms can also be modified to produce the sum of the primes up to a limit.

Testing Primality, and Integer Factorization

See Testing primality:

One-liners

See primes one-liners.

External links

http://www.cs.hmc.edu/~oneill/code/haskell-primes.zip (archived copy)

A collection of prime generators; the file "ONeillPrimes.hs" contains one of the fastest pure-Haskell prime generators; code by Melissa O'Neill.

WARNING: Don't use the priority queue from older versions of that file for your projects: it's broken and works for primes only by a lucky chance. The revised version of the file fixes the bug, as pointed out by Eugene Kirpichov on February 24, 2009 on the haskell-cafe mailing list, and fixed by Bertram Felgenhauer.

test entries for (some of) the above code variants.

Speed/memory comparison table for sieve of Eratosthenes variants.

Empirical orders of growth on Wikipedia.

@@ Line 1: / Line 1: @@
-In mathematics, ''amongst the natural numbers greater than 1'', a ''prime number'' (or a ''prime'') is such that has no divisors other than itself (and 1).
+In mathematics, ''amongst the natural numbers greater than 1'', a ''prime number'' (or a ''prime'') is such that has no divisors other than itself (and 1). This means that it cannot be represented as a product of any two of such numbers.
 == Prime Number Resources ==
@@ Line 18: / Line 18: @@
 == Definition ==
-In mathematics, ''amongst the natural numbers greater than 1'', a ''prime number'' (or a ''prime'') is such that has no divisors other than itself (and 1). The smallest prime is thus 2. Non-prime numbers are known as ''composite'', i.e. those representable as product of two natural numbers greater than 1. The set of prime numbers is thus
+In mathematics, ''amongst the natural numbers greater than 1'', a ''prime number'' (or a ''prime'') is such that has no divisors other than itself (and 1). The smallest prime is thus 2. Non-prime numbers are known as ''composite'', i.e. those representable as product of two natural numbers greater than 1.
+To find out a prime's multiples we can either '''a.''' test each new candidate number for divisibility by that prime, giving rise to a kind of ''trial division'' algorithm; or '''b.''' we can directly generate the multiples of a prime ''p'' by counting up from it in increments of ''p'', resulting in a variant of the ''sieve of Eratosthenes''.
-:: '''P''' = {''n'' &isin; '''N'''<sub>2</sub> ''':''' (&forall; ''m'' &isin; '''N'''<sub>2</sub>) (''m''&times;''m'' &le; ''n'' &rArr; &not;(''m'' | ''n''))}
+The set of prime numbers is thus
-:::  = '''N'''<sub>2</sub> \ {''n''&times;''m'' ''':''' ''n'',''m'' &isin; '''N'''<sub>2</sub>} =
-::: = '''N'''<sub>2</sub> \ '''&cup;''' { {''n''&times;''m'' ''':''' ''m'' &isin; '''N'''<sub>n</sub>} ''':''' ''n'' &isin; '''N'''<sub>2</sub> }
+: &nbsp;&nbsp; '''P''' &nbsp; = &nbsp; {&#8202;''n'' &isin; '''N'''<sub>2</sub> ''':''' (&forall; ''m'' &isin; '''N'''<sub>2</sub>) (&#8202;(''m'' | ''n'') &rArr; m = n)&#8202;}
-::: = '''N'''<sub>2</sub> \ '''&cup;''' { {''n''&times;''n'', ''n''&times;''n''+''n'', ''n''&times;''n''+2&times;''n'', ...} ''':''' ''n'' &isin; '''N'''<sub>2</sub> } <!-- '''N'''<sub>2</sub>  -->  where '''N'''<sub>k</sub> = {''n'' &isin; '''N''' ''':''' ''n'' &ge; k}.
+:: &nbsp; = &nbsp; {&#8202;''n'' &isin; '''N'''<sub>2</sub> ''':''' (&forall; ''m'' &isin; '''N'''<sub>2</sub>) (&#8202;''m''&#8202; &lt; ''n'' &rArr; &not;(''m'' | ''n''))&#8202;}
+:: &nbsp; = &nbsp; {&#8202;''n'' &isin; '''N'''<sub>2</sub> ''':''' (&forall; ''m'' &isin; '''N'''<sub>2</sub>) (&#8202;''m''&#8202;⋅&#8202;''m'' &le; ''n'' &rArr; &not;(''m'' | ''n''))&#8202;}
-Thus starting with 2, for each newly found prime we can ''eliminate'' from the rest of the numbers ''all the multiples'' of this prime, giving us the next available number as next prime. This is known as ''sieving'' the natural numbers, so that in the end all the composites are eliminated and what we are left with are just primes.
+:: &nbsp; = &nbsp; '''N'''<sub>2</sub> \ { ''n''&#8202;⋅&#8202;''m'' ''':''' ''n'',''m'' &isin; '''N'''<sub>2</sub> }
-To eliminate a prime's multiples we can either '''a.''' test each new candidate number for divisibility by that prime, giving rise to a kind of ''trial division'' algorithm; or '''b.''' we can directly generate the multiples of a prime ''<code>p</code>'' by counting up from it in increments of ''<code>p</code>'', resulting in a variant of the ''sieve of Eratosthenes''.
+:: &nbsp; = &nbsp; '''N'''<sub>2</sub> \  '''N'''<sub>2</sub> ⋅ '''N'''<sub>2</sub>
-Having a direct-access mutable arrays indeed enables easy marking off of these multiples on pre-allocated array as it is usually done in imperative languages; but to get an [[#Tree merging with Wheel|efficient ''list''-based code]] we have to be smart about combining those streams of multiples of each prime - which gives us also the memory efficiency in generating the results incrementally, one by one.
+:: &nbsp; = &nbsp; '''N'''<sub>2</sub> \ '''&#8899;''' { {&#8202;''n''&#8202;⋅&#8202;''m'' ''':''' ''m'' &isin; '''N'''<sub>''n''</sub>&#8202;} ''':''' ''n'' &isin; '''N'''<sub>2</sub> }
+:: &nbsp; = &nbsp; '''N'''<sub>2</sub> \ '''&#8899;''' { {&#8202;''n''&#8202;⋅&#8202;''n'', ''n''&#8202;⋅&#8202;''n''+''n'', ''n''&#8202;⋅&#8202;''n''+''n''+''n'', ...&#8202;} ''':''' ''n'' &isin; '''N'''<sub>2</sub> }
+:: &nbsp; = &nbsp; '''N'''<sub>2</sub> \ '''&#8899;''' { {&#8202;''p''&#8202;⋅&#8202;''p'', ''p''&#8202;⋅&#8202;''p''+''p'', ''p''&#8202;⋅&#8202;''p''+''p''+''p'', ...&#8202;} ''':''' ''p'' &isin; '''P''' }
+:: &nbsp; &nbsp; &nbsp; (&nbsp; = &nbsp; '''N'''<sub>2</sub> \ '''&#8899;''' { {&#8202;''p''&#8202;⋅&#8202;''m'' ''':''' ''m'' &isin; '''N'''<sub>''p''</sub>&#8202;} ''':''' ''p'' &isin; '''P''' }
+:: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; = &nbsp; '''N'''<sub>2</sub> \ '''&#8899;'''<sub>''p'' &isin; '''P'''</sub> {&#8202;''p''&#8202; ⋅&#8202; '''N'''<sub>''p''</sub>&#8202;}
+:: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; = &nbsp; '''N'''<sub>2</sub> \  '''P''' ⋅ '''N'''<sub>&#8202;'''P'''</sub> &nbsp;)
+:::: &nbsp; where &nbsp; '''N'''<sub>''k''</sub> = { ''n'' &isin; '''N''' ''':''' ''n'' &ge; ''k'' }
+Thus starting with 2, for each newly found prime we can ''eliminate'' from the rest of the numbers ''all the multiples'' of this prime, giving us the next available number as next prime. This is known as ''sieving'' the natural numbers, so that in the end all the composites are eliminated and what we are left with are just primes. <small>(This is what the last "pseudo"-formulas are describing, though seemingly [http://en.wikipedia.org/wiki/Impredicativity impredicative], because of being self-referential.)</small>
+Prototypically, it is
+<haskell>
+primes = -- foldl (\\) [2..] [[p+p, p+p+p..] | p <- primes]
+   map head $ scanl (\\) [2..] [[p, p+p..] | p <- primes]
+              where
+                    (\\) = Data.List.Ordered.minus
+</haskell>
+Having (a chain of) direct-access mutable arrays indeed enables easy marking of these multiples as is usually done in imperative languages; but to get an [[#Tree merging with Wheel|efficient ''list''-based code]] we have to be smart about combining those streams of multiples of each prime - which gives us also the memory efficiency in generating the results incrementally, one by one.
+Short exposition is [[Prime numbers miscellaneous#A Tale of Sieves|here]].
 == Sieve of Eratosthenes ==
+Simplest, bounded, ''very'' inefficient formulation:
+<haskell>
+import Data.List (\\)  -- (\\) is set-difference for unordered lists
+primesTo m = sieve [2..m]
-The sieve of Eratosthenes calculates primes as natural numbers above 1 without the composites, where composites are found as sequences of multiples of each prime, generated by counting up from the prime's square in constant increments equal to that prime (or twice that much, for odd primes):
+             where
+             sieve (x:xs) = x : sieve (xs \\ [x,x+x..m])
+             sieve [] = []
+      -- or:
+           = ps
+             where
+             ps = map head $ takeWhile (not.null)
+                           $ scanl (\\) [2..m] [[p, p+p..m] | p <- ps]
+</haskell>
+The (unbounded) sieve of Eratosthenes calculates primes as ''integers above 1 that are not multiples of primes'', i.e. ''not composite'' &mdash; whereas composites are found as enumeration of multiples of each prime, generated by counting up from prime's square in constant increments equal to that prime (or twice that much, for odd primes). This is much more efficient and runs at about <i>n<sup>1.2</sup></i> empirical orders of growth (corresponding to <i>n (log n)<sup>2</sup> log log n</i> complexity, more or less, in ''n'' primes produced):
 <haskell>
+import Data.List.Ordered (minus, union, unionAll)
-primes = 2 : 3 : ([5,7..] `minus` unionAll [[p*p, p*p+2*p..] | p <- tail primes])
+primes = 2 : 3 : minus [5,7..] (unionAll [[p*p, p*p+2*p..] | p <- tail primes])
+{- Using `under n = takeWhile (<= n)`, with ordered increasing lists,
+   `minus`, `union` and `unionAll` satisfy, for any `n` and `m`:
+  under n (minus a b)         == nub . sort $ under n a \\ under n b
+  under n (union a b)         == nub . sort $ under n a ++ under n b
+  under n . unionAll . take m == under n . foldl union [] . take m
+  under n . unionAll          == nub . sort . concat
+                                     . takeWhile (not.null) . map (under n) -}
 </haskell>
+The definition is primed with 2 and 3 as initial primes, to avoid the vicious circle.
-<code>unionAll</code> can be defined as an ever-growing, right-deepening unbounded [[#Tree_merging|tree of comparisons]] like the <code>joinT</code> function below, or it can use a (fixed-sized) array as a segment-wise sorting and duplicates-removing device, to join together the streams of multiples of primes. The processing is naturally divided up into segments between the successive squares of primes.
+The ''"big union"'' <code>unionAll</code> function could be defined as the folding of <code>(\(x:xs) -> (x:) . union xs)</code>; or it could use a <code>Bool</code> array as a sorting and duplicates-removing device. The processing naturally divides up into the segments between successive squares of primes.
 Stepwise development follows (the fully developed version is [[#Tree merging with Wheel|here]]).
@@ Line 48: / Line 104: @@
 === Initial definition ===
-To start with, the sieve of Eratosthenes can be genuinely represented by
+First of all, working with ''ordered'' increasing lists, the sieve of Eratosthenes can be genuinely represented by
 <haskell>
--- genuine yet wasteful sieve of Eratosthenes
+ -- genuine yet wasteful sieve of Eratosthenes
-primesTo m = 2 : eratos [3,5..m]  where
+ -- primes = eratos [2.. ]               -- unbounded
-   eratos []     = []
+primesTo m = eratos [2..m]               -- bounded, up to m
+    where
-   eratos (p:xs) = p : eratos (xs `minus` [p, p+2*p..m])
+    eratos []     = []
--- eulers (p:xs) = p : eulers (xs `minus` map (p*)(p:xs))
--- turner (p:xs) = p : turner [x | x <- xs, rem x p /= 0]
+    eratos (p:xs) = p : eratos (xs `minus` [p, p+p..])
+ -- eratos (p:xs) = p : eratos (xs `minus` map (p*) [1..])
+ -- eulers (p:xs) = p : eulers (xs `minus` map (p*) (p:xs))
+ -- turner (p:xs) = p : turner [x | x <- xs, rem x p /= 0]
+ -- fix ( map head . scanl minus [2..] . map (\p-> [p, p+p..]) )
 </haskell>
-This should be regarded more like a ''specification'', not a code. It is extremely slow, running at empirical time complexities worse than quadratic in number of primes produced. But it has the core defining features of the classical formulation of S. of E. as '''''a.''''' being bounded, i.e. having a top limit value, and '''''b.''''' finding out the multiples of a prime directly, by counting up from it in increments equal to that prime (or twice that, for odd primes).
+This should be regarded more like a ''specification'', not a code. It runs at [https://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth empirical orders of growth] worse than quadratic in number of primes produced. But it has the core defining features of the classical formulation of S. of E. as '''''a.''''' being bounded, i.e. having a top limit value, and '''''b.''''' finding out the multiples of a prime directly, by counting up from it in constant increments, equal to that prime.
-The canonical list-difference <code>minus</code> and duplicates-removing list-union <code>union</code> functions dealing with ordered increasing lists - infinite as well as finite - are simple enough to define (cf. Leon P.Smith's [http://hackage.haskell.org/packages/archive/data-ordlist/latest/doc/html/Data-List-Ordered.html Data.List.Ordered] package):
+The canonical list-difference <code>minus</code> and duplicates-removing <code>union</code> functions (cf. [http://hackage.haskell.org/packages/archive/data-ordlist/latest/doc/html/Data-List-Ordered.html Data.List.Ordered]) are:
 <haskell>
--- ordered lists, difference and union
+ -- ordered lists, difference and union
 minus (x:xs) (y:ys) = case (compare x y) of
            LT -> x : minus  xs  (y:ys)
@@ Line 76: / Line 137: @@
 </haskell>
-The name ''merge'' ought to be reserved for duplicates-preserving merging operation of mergesort.
+The name ''merge'' ought to be reserved for duplicates-preserving merging operation of the merge sort.
 === Analysis ===
-So for each newly found prime the sieve removes its ''odd'' multiples from further consideration. It finds them by counting up in steps of ''2p''. There are thus <math>O(m/p)</math> multiples generated and eliminated for each prime, and <math>O(m \log \log(m))</math> multiples in total, with duplicates, by virtues of [http://en.wikipedia.org/wiki/Prime_harmonic_series prime harmonic series].
+So for each newly found prime ''p'' the sieve discards its multiples, enumerating them by counting up in steps of ''p''. There are thus <math>O(m/p)</math> multiples generated and eliminated for each prime, and <math>O(m \log \log(m))</math> multiples in total, with duplicates, by virtues of [http://en.wikipedia.org/wiki/Prime_harmonic_series prime harmonic series].
 If each multiple is dealt with in <math>O(1)</math> time, this will translate into <math>O(m \log \log(m))</math> RAM machine operations (since we consider addition as an atomic operation). Indeed mutable random-access arrays allow for that. But lists in Haskell are sequential-access, and complexity of <code>minus(a,b)</code> for lists is <math>\textstyle O(|a \cup b|)</math> instead of <math>\textstyle O(|b|)</math> of the direct access destructive array update. The lower the complexity of each ''minus'' step, the better the overall complexity.
-So on ''k''-th step the argument list <code>(p:xs)</code> that the <code>eratos</code> function gets starts with the ''(k+1)''-th prime, and consists of all the numbers &le; ''m'' coprime with all the primes &le; ''p''. According to the M. O'Neill's article (p.10) there are <math>\textstyle\Phi(m,p) \in \Theta(m/\log p)</math> such numbers.
+So on ''k''-th step the argument list <code>(p:xs)</code> that the <code>eratos</code> function gets, starts with the ''(k+1)''-th prime, and consists of all the numbers &le; ''m'' coprime with all the primes &le; ''p''. According to the M. O'Neill's article (p.10) there are <math>\textstyle\Phi(m,p) \in \Theta(m/\log p)</math> such numbers.
 It looks like <math>\textstyle\sum_{i=1}^{n}{1/log(p_i)} = O(n/\log n)</math> for our intents and purposes. Since the number of primes below ''m'' is <math>n = \pi(m) = O(m/\log(m))</math> by the prime number theorem (where <math>\pi(m)</math> is a prime counting function), there will be ''n'' multiples-removing steps in the algorithm; it means total complexity of at least <math>O(m n/\log(n)) = O(m^2/(\log(m))^2)</math>, or <math>O(n^2)</math> in ''n'' primes produced - much much worse than the optimal <math>O(n \log(n) \log\log(n))</math>.
@@ Line 90: / Line 151: @@
 === From Squares ===
-But we can start each step at a prime's square, as its smaller multiples will have been already produced on previous steps. This means we can stop early, when the prime's square reaches the top value ''m'', and thus ''cut the total number of steps'' to around <math>\textstyle n = \pi(m^{0.5}) = \Theta(2m^{0.5}/\log m)</math>. This does not in fact change the complexity of random-access code, but for lists it makes it <math>O(m^{1.5}/(\log m)^2)</math>, or <math>O(n^{1.5}/(\log n)^{0.5})</math> in ''n'' primes produced, a dramatic speedup:
+But we can start each elimination step at a prime's square, as its smaller multiples will have been already produced and discarded on previous steps, as multiples of smaller primes. This means we can stop early now, when the prime's square reaches the top value ''m'', and thus cut the total number of steps to around <math>\textstyle n = \pi(m^{0.5}) = \Theta(2m^{0.5}/\log m)</math>. This does not in fact change the complexity of random-access code, but for lists it makes it <math>O(m^{1.5}/(\log m)^2)</math>, or <math>O(n^{1.5}/(\log n)^{0.5})</math> in ''n'' primes produced, a dramatic speedup:
 <haskell>
-primesToQ m = 2 : sieve [3,5..m]  where
+primesToQ m = eratos [2..m]
+    where
-    sieve []     = []
-    sieve (p:xs) = p : sieve (xs `minus` [p*p, p*p+2*p..m])
+    eratos []     = []
+    eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..m])
+ -- eratos (p:xs) = p : eratos (xs `minus` map (p*) [p..div m p])
+ -- eulers (p:xs) = p : eulers (xs `minus` map (p*) (under (div m p) (p:xs)))
+ -- turner (p:xs) = p : turner [x | x<-xs, x<p*p || rem x p /= 0]
 </haskell>
-Its empirical complexity is about <math>O(n^{1.45})</math>. This simple optimization works here because our formulation is bounded, as is the original algorithm. To start late on a bounded sequence is to stop early (if we start past its end we don't need to start at all &ndash; ''see warning below''), thus preventing the creation of all the superfluous multiples streams which started above the upper bound anyway. This is acceptably slow now, striking a good balance between clarity, succinctness and efficiency.
+Its empirical complexity is around <math>O(n^{1.45})</math>. This simple optimization works here because this formulation is bounded (by an upper limit). To start late on a bounded sequence is to stop early (starting past end makes an empty sequence &ndash; ''see warning below''<sup><sub> 1</sub></sup>), thus preventing the creation of all the superfluous multiples streams which start above the upper bound anyway <small>(note that Turner's sieve is unaffected by this)</small>. This is acceptably slow now, striking a good balance between clarity, succinctness and efficiency.
-<small>''Warning'': this is predicated on a subtle point of  <code>minus xs [] = xs</code> definition being used, as it indeed should be. If the definition <code>minus (x:xs) [] = x:minus xs []</code> is used, the problem is back and the complexity is bad again.</small>
+<sup><sub>1</sub></sup><small>''Warning'': this is predicated on a subtle point of  <code>minus xs [] = xs</code> definition being used, as it indeed should be. If the definition <code>minus (x:xs) [] = x:minus xs []</code> is used, the problem is back and the complexity is bad again.</small>
 === Guarded ===
 This ought to be ''explicated'' (improving on clarity, though not on time complexity) as in the following, for which it is indeed a minor optimization whether to start from ''p'' or ''p*p'' - because it explicitly ''stops as soon as possible'':
 <haskell>
-primesToG m = 2 : sieve [3,5..m]  where
+primesToG m = 2 : sieve [3,5..m]
+    where
     sieve (p:xs)
        | p*p > m   = p : xs
        | otherwise = p : sieve (xs `minus` [p*p, p*p+2*p..])
+                  -- p : sieve (xs `minus` map (p*) [p,p+2..])
+                  -- p : eulers (xs `minus` map (p*) (p:xs))
 </haskell>
-It is now clear that it can't be made unbounded just by abolishing the upper bound ''m'', because the guard can not be simply omitted without changing the complexity back for the worst.
+(here we also flatly ignore all evens above 2 a priori.) It is now clear that it ''can't'' be made unbounded just by abolishing the upper bound ''m'', because the guard can not be simply omitted without changing the complexity back for the worst.
 === Accumulating Array ===
-So while <code>minus(a,b)</code> takes <math>O(|b|)</math> operations for random-access imperative arrays and about <math>O(|a|)</math> operations for lists here, using Haskell's immutable array for ''a'' one ''could'' expect the array update time to be nevertheless closer to <math>O(|b|)</math> if destructive update were used implicitly by compiler for an array being passed along as an accumulating parameter:
+So while <code>minus(a,b)</code> takes <math>O(|b|)</math> operations for random-access imperative arrays and about <math>O(|a|)</math> operations here for ordered increasing lists of numbers, when using Haskell's immutable array for ''a'' one ''could'' expect the array update time to be nevertheless closer to <math>O(|b|)</math> if destructive update was used implicitly by compiler for an array being passed along as an accumulating state parameter:
 <haskell>
 {-# OPTIONS_GHC -O2 #-}
@@ Line 120: / Line 188: @@
 primesToA m = sieve 3 (array (3,m) [(i,odd i) | i<-[3..m]]
                         :: UArray Int Bool)
-  where
+    where
     sieve p a
-      | p*p > m   = 2 : [i | (i,True) <- assocs a]
+       | p*p > m   = 2 : [i | (i,True) <- assocs a]
-      | a!p       = sieve (p+2) $ a//[(i,False) | i <- [p*p, p*p+2*p..m]]
+       | a!p       = sieve (p+2) $ a // [(i,False) | i <- [p*p, p*p+2*p..m]]
-      | otherwise = sieve (p+2) a
+       | otherwise = sieve (p+2) a
 </haskell>
-This indeed seems to be working for unboxed arrays, with the type signature added explicitly <small>(suggested by Daniel Fischer)</small>, the above code running relatively '''very fast''', with empirical complexity of about '''<math>O(n^{1.15..1.45})</math>''' in ''n'' primes produced (for producing from few hundred thousands to few millions primes, memory usage also slowly growing). But it relies on specific compiler optimizations, and indeed if we remove the explicit type signature, the code above turns ''very'' slow.
+Indeed for unboxed arrays <small>(suggested by Daniel Fischer; with regular, boxed arrays it is ''very'' slow)</small>, the above code runs pretty fast, but with empirical complexity of ''O(n<sup>1.15..1.45</sup>)'' in ''n'' primes produced <small>(for producing from few hundred thousands to few millions primes, memory usage also slowly growing)</small>. If the update for each index were an <i>O(1)</i> operation, the empirical complexity  would be seen as diminishing, as ''O(n<sup>1.15..1.05</sup>)'', reflecting the true, linearithmic complexity.
-How can we write code that we'd be certain about? One way is to use explicitly mutable monadic arrays ([[#Using Mutable Arrays|''see below'']]), but we can also think about it a little bit more on the functional side of things still.
+We could use explicitly mutable monadic arrays ([[#Using Mutable Arrays|''see below'']]) to remedy this, but we can also think about it a little bit more on the functional side of things still.
 === Postponed ===
-Going back to ''guarded'' Eratosthenes, first we notice that though it works with minimal number of prime multiples streams, it still starts working with each a bit prematurely. Fixing this with explicit synchronization won't change complexity but will speed it up some more:
+Going back to ''guarded'' Eratosthenes, first we notice that though it works with minimal number of prime multiples streams, it still starts working with each prematurely. Fixing this with explicit synchronization won't change complexity but will speed it up some more:
 <haskell>
+primesPE1 = 2 : sieve [3..] primesPE1
-primesPE = 2 : primes'
-  where
+    where
+    sieve xs (p:pt) | q <- p*p , (h,t) <- span (< q) xs =
-    primes' = sieve [3,5..] 9 primes'
-    sieve (x:xs) q ps@ ~(p:t)
+                   h ++ sieve (t `minus` [q, q+p..]) pt
-      | x < q     = x : sieve xs q ps
+                -- h ++ turner [x | x<-t, rem x p>0] pt
-      | otherwise =     sieve (xs `minus` [q, q+2*p..]) (head t^2) t
 </haskell>
+<!-- primesPE1 = 2 : sieve [3,5..] 9 (tail primesPE1)
-Since the removal of a prime's multiples here starts at the right moment, and not just from the right place, the code could '''''now''''' finally be made unbounded. Because no multiples-removal process is started ''prematurely'', there are no ''extraneous'' multiples streams, which were the reason for the extreme wastefulness and thus inefficiency of the original formulation.
+    where
+    sieve xs q ~(p:pt) | (h,t) <- span (< q) xs =
+                   h ++ sieve (t `minus` [q, q+2*p..]) (head pt^2) pt
+                -- h ++ turner [x | x<-t, rem x p>0] ...
+-->
+Inlining and fusing <code>span</code> and <code>(++)</code> we get:
+<!--
+primesPE = 2 : ops
+    where
+    ops = sieve [3,5..] 9 ops                             -- odd primes
+    sieve (x:xs) q ps@ ~(p:pt)
+       | x < q     = x : sieve xs q ps
+       | otherwise =     sieve (xs `minus` [q, q+2*p..]) (head pt^2) pt -->
+<haskell>
+primesPE = 2 : sieve [3..] [[p*p, p*p+p..] | p <- primesPE]
+               where
+               sieve (x:xs) t@((q:cs):r)
+                    | x < q = x : sieve xs t
+                    | otherwise = sieve (minus xs cs) r
+</haskell>
+Since the removal of a prime's multiples here starts at the right moment, and not just from the right place, the code could now finally be made unbounded. Because no multiples-removal process is started ''prematurely'', there are no ''extraneous'' multiples streams, which were the reason for the original formulation's extreme inefficiency.
 === Segmented ===
@@ Line 147: / Line 235: @@
 <haskell>
-primesSE = 2 : primes'
+primesSE = 2 : ops
-  where
+    where
-    primes' = sieve 3 9 primes' []
+    ops = sieve 3 9 ops []                                -- odd primes
-    sieve x q ~(p:t) fs =
+    sieve x q ~(p:pt) fs =
-        foldr (flip minus) [x,x+2..q-2]
+        foldr (flip minus) [x,x+2..q-2]                   -- chain of subtractions
-                           [[y+s, y+2*s..q] | (s,y) <- fs]
+                           [[y+s, y+2*s..q] | (s,y) <- fs]     -- OR,
+        -- [x,x+2..q-2] `minus` foldl union []            -- subtraction of merged
-        ++ sieve (q+2) (head t^2) t
+        --                    [[y+s, y+2*s..q] | (s,y) <- fs]            --  lists
+        ++ sieve (q+2) (head pt^2) pt
                 ((2*p,q):[(s,q-rem (q-y) s) | (s,y) <- fs])
 </haskell>
@@ Line 165: / Line 255: @@
 <haskell>
-primesSTE = 2 : primes'
+primesSTE = 2 : ops
-  where
+    where
-    primes' = sieve 3 9 primes' []
+    ops = sieve 3 9 ops []                                -- odd primes
-    sieve x q ~(p:t) fs =
+    sieve x q ~(p:pt) fs =
         ([x,x+2..q-2] `minus` joinST [[y+s, y+2*s..q] | (s,y) <- fs])
-        ++ sieve (q+2) (head t^2) t
+        ++ sieve (q+2) (head pt^2) pt
                    ((++ [(2*p,q)]) [(s,q-rem (q-y) s) | (s,y) <- fs])
-joinST (x:xs) = union x (joinST (pairs xs)) where
+joinST (xs:t) = (union xs . joinST . pairs) t
+    where
-    pairs (x:y:xs) = union x y : pairs xs
-    pairs xs       = xs
+    pairs (xs:ys:t) = union xs ys : pairs t
+    pairs t         = t
 joinST []     = []
 </haskell>
-=== Linear merging ===
+====Segmented merging via an array====
-But segmentation doesn't add anything substantially, and each multiples stream starts at its prime's square anyway. What does the [[#Postponed|Postponed]] code do, operationally? For each prime's square passed over, it builds up a nested linear ''left-deepening'' structure, '''''(...((xs-a)-b)-...)''''', where '''''xs''''' is the original odds-producing ''[3,5..]'' list, so that each odd it produces must go through more and more <code>`minus`</code> nodes on its way up - and those odd numbers that eventually emerge on top are prime. Thinking a bit about it, wouldn't another, ''right-deepening'' structure, '''''(xs-(a+(b+...)))''''', be better? This idea is due to Richard Bird (in the code presented in Melissa O'Neill's article).
+The removal of composites is easy with arrays. Starting points can be calculated directly:
-Here, ''xs'' would stay near the top, and ''more frequently'' odds-producing streams of multiples of ''smaller'' primes would be above those of the ''bigger'' primes, that produce ''less frequently'' their candidates which have to pass through ''more'' <code>`union`</code> nodes on their way up. Plus, no explicit synchronization is necessary anymore because the produced multiples of a prime start at its square anyway - just some care has to be taken to avoid a runaway access to the infinitely-defined structure (specifically, if each <code>(+)</code> operation passes along unconditionally its left child's head value ''before'' polling the right child's head value, then we are safe).
-Here's the code, faster yet but still with same time complexity of about <math>O(n^{1.4})</math>:
 <haskell>
+import Data.List (inits, tails)
-primesLME = 2 : ([3,5..] `minus` joinL [[p*p, p*p+2*p..] | p <- primes'])
+import Data.Array.Unboxed
+primesSAE = 2 : sieve 2 4 (tail primesSAE) (inits primesSAE)
+         -- (2:) . (sieve 2 4 . tail <*> inits) $ primesSAE
   where
+  sieve r q ps (fs:ft) = [n | (n,True) <- assocs (
-    primes' = 3 : ([5,7..] `minus` joinL [[p*p, p*p+2*p..] | p <- primes'])
+         accumArray (\ _ _ -> False) True (r+1,q-1)
+                    [(m,()) | p <- fs, let s = p * div (r+p) p,
+                              m <- [s,s+p..q-1]] :: UArray Int Bool )]
+      ++ sieve q (head ps^2) (tail ps) ft
+</haskell>
+The pattern of iterated calls to <code>tail</code> is captured by a higher-order function <code>tails</code>, which explicitly generates the stream of tails of a stream, making for a bit more readable (even if possibly a bit less efficient) code:
+<haskell>
+psSAGE = 2 : [n | (r:q:_, fs) <- (zip . tails . (2:) . map (^2) <*> inits) psSAGE,
+                  (n,True)    <- assocs (
+                                   accumArray (\_ _ -> False) True (r+1, q-1)
+                        [(m,()) | p <- fs, let s = (r+p)`div`p*p,
+                                  m <- [s,s+p..q-1]] :: UArray Int Bool )]
+</haskell>
+=== Linear merging ===
+But segmentation doesn't add anything substantially, and each multiples stream starts at its prime's square anyway. What does the [[#Postponed|Postponed]] code do, operationally? With each prime's square passed by, there emerges a  nested linear ''left-deepening'' structure, '''''(...((xs-a)-b)-...)''''', where '''''xs''''' is the original odds-producing ''[3,5..]'' list, so that each odd it produces must go through more and more <code>minus</code> nodes on its way up - and those odd numbers that eventually emerge on top are prime. Thinking a bit about it, wouldn't another, ''right-deepening'' structure, '''''(xs-(a+(b+...)))''''', be better? This idea is due to Richard Bird, seen in his code presented in M. O'Neill's article, equivalent to:
+<haskell>
+primesB = 2 : minus [3..] (foldr (\p r-> p*p : union [p*p+p, p*p+2*p..] r)
+                                 [] primesB)
+</haskell>
+or,
+<haskell>
+primesLME1 = 2 : prs
+    where
+    prs = 3 : minus [5,7..] (joinL [[p*p, p*p+2*p..] | p <- prs])
 joinL ((x:xs):t) = x : union xs (joinL t)
 </haskell>
+Here, ''xs'' stays near the top, and ''more frequently'' odds-producing streams of multiples of smaller primes are ''above'' those of the bigger primes, that produce ''less frequently'' their multiples which have to pass through ''more'' <code>union</code> nodes on their way up. Plus, no explicit synchronization is necessary anymore because the produced multiples of a prime start at its square anyway - just some care has to be taken to avoid a runaway access to the indefinitely-defined structure, defining <code>joinL</code> (or <code>foldr</code>'s combining function) to produce part of its result ''before'' accessing the rest of its input (thus making it ''productive'').
-The double primes feed is introduced here to prevent unneeded memoization and thus prevent memory leak, as per Melissa O'Neill's code.
+Melissa O'Neill [http://hackage.haskell.org/packages/archive/NumberSieves/0.0/doc/html/src/NumberTheory-Sieve-ONeill.html introduced double primes feed] to prevent unneeded memoization (a memory leak). We can even do multistage. Here's the code, faster still and with radically reduced memory consumption, with empirical orders of growth of around ~ <math>n^{1.40}</math> (initially better, yet worsening for bigger ranges):
-=== Tree merging ===
-Moreover, it can be changed into a '''''tree''''' structure. This idea is due to Dave Bayer on haskell-cafe mailing list (though in much more complex formulation):
 <haskell>
-primesTME = 2 : (gaps 3 $ joinT [[p*p, p*p+2*p..] | p <- primes'])
+primesLME = 2 : _Y ((3:) . minus [5,7..] . joinL . map (\p-> [p*p, p*p+2*p..]))
-  where
+_Y   :: (t -> t) -> t
-    primes' = 3 : (gaps 5 $ joinT [[p*p, p*p+2*p..] | p <- primes'])
+_Y g = g (_Y g)                -- multistage, non-sharing,  g (g (g (g ...)))
+    -- g (let x = g x in x)    -- two g stages, sharing
-joinT ((x:xs):t) = x : union xs (joinT (pairs t))
-pairs ((x:xs):ys:t) = (x : union xs ys) : pairs t
-gaps k s@(x:xs) | k<x  = k:gaps (k+2) s    -- equivalent to
-                | True =   gaps (k+2) xs   --  [k,k+2..]`minus`s, k<=head s
 </haskell>
+<code>_Y</code> is a non-sharing fixpoint combinator, here arranging for a recursive ''"telescoping"'' multistage primes production (a ''tower'' of producers).
-It is [http://ideone.com/p0e81 very fast], running at speeds and empirical complexities comparable with the code from Melissa O'Neill's article (about <math>O(n^{1.2})</math> in number of primes ''n'' produced).
+This allows the <code>primesLME</code> stream to be discarded immediately as it is being consumed by its consumer. For <code>prs</code> from <code>primesLME1</code> definition above it is impossible, as each produced element of <code>prs</code> is needed later as input to the same <code>prs</code> corecursive stream definition. So the <code>prs</code> stream feeds in a loop into itself and is thus retained in memory, being consumed by self much slower than it is produced. With multistage production, each stage feeds into its consumer above it at the square of its current element which can be immediately discarded after it's been consumed. <code>(3:)</code> jump-starts the whole thing.
-As an aside, <code>joinT</code> is equivalent to calling [[Fold#Tree-like_folds|infinite tree-like folding]] <code>foldi (\(x:xs) ys->x:union xs ys) []</code>.
+=== Tree merging ===
-The above can be [http://ideone.com/qpnqe rewritten] as follows, using a standard (corecursive) definition for <code>fix</code> combinator from <code>Data.Function</code> to arrange for double primes feed automatically:
+Moreover, it can be changed into a '''''tree''''' structure. This idea [http://www.haskell.org/pipermail/haskell-cafe/2007-July/029391.html is due to Dave Bayer] and  [[Prime_numbers_miscellaneous#Implicit_Heap|Heinrich Apfelmus]]:
-<haskell>import Data.Function (fix)
+<haskell>
+primesTME = 2 : _Y ((3:) . gaps 5 . joinT . map (\p-> [p*p, p*p+2*p..]))
-primesTMEf = 2 : g (fix g)
+-- joinL ((x:xs):t) = x : union xs (joinL t)
+joinT ((x:xs):t) = x : union xs (joinT (pairs t))    -- set union, ~=
-  where
-    g xs = 3 : gaps 5 (joinT [[x*x, x*x+2*x..] | x <- xs])
+  where  pairs (xs:ys:t) = union xs ys : pairs t     --    nub.sort.concat
+gaps k s@(x:xs) | k < x = k:gaps (k+2) s       -- ~= [k,k+2..]\\s,
--- fix g = xs where xs = g xs
+                | True  =   gaps (k+2) xs      --   when null(s\\[k,k+2..])
 </haskell>
+This code is [http://ideone.com/p0e81 pretty fast], running at speeds and empirical complexities comparable with the code from Melissa O'Neill's article (about <math>O(n^{1.2})</math> in number of primes ''n'' produced).
+As an aside, <code>joinT</code> is equivalent to [[Fold#Tree-like_folds|infinite tree-like folding]] <code>foldi (\(x:xs) ys-> x:union xs ys) []</code>:
+[[Image:Tree-like_folding.gif|frameless|center|458px|tree-like folding]]
+[https://hackage.haskell.org/package/data-ordlist-0.4.7.0/docs/Data-List-Ordered.html#v:foldt <code>Data.List.Ordered.foldt</code>] of the  <i>data-ordlist</i> package builds the same structure, but in a lazier fashion, consuming its input at the slowest pace possible. Here this sophistication is not needed (evidently).
 === Tree merging with Wheel ===
-Wheel factorization optimization can be further applied, and another tree structure can be used which is better adjusted for the primes multiples production (effecting about 5%-10% at the top of a total ''2.5x speedup'' w.r.t. the above tree merging on odds only - though complexity stays roughly the same):
+Wheel factorization optimization can be further applied, and another tree structure can be used which is better adjusted for the primes multiples production (effecting about 5%-10% at the top of a total ''2.5x speedup'' w.r.t. the above tree merging on odds only, for first few million primes):
 <haskell>
+primesTMWE = [2,3,5,7] ++ _Y ((11:) . tail  . gapsW 11 wheel
-{-# OPTIONS_GHC -O2 -fno-cse #-}
+                                    . joinT . hitsW 11 wheel)
-primesTMWE = 2:3:5:7: gapsW 11 wheel (joinT3 $ rollW 11 wheel primes')
-  where
-    primes' = 11: gapsW 13 (tail wheel) (joinT3 $ rollW 11 wheel primes')
-gapsW k ws@(w:t) cs@(c:u) | k==c  = gapsW (k+w) t u
+gapsW k (d:w) s@(c:cs) | k < c     = k : gapsW (k+d) w s    -- set difference
-                          | True  = k : gapsW (k+w) t cs
+                       | otherwise =     gapsW (k+d) w cs   --   k==c
-rollW k ws@(w:t) ps@(p:u) | k==p  = scanl (\c d->c+p*d) (p*p) ws
+hitsW k (d:w) s@(p:ps) | k < p     =     hitsW (k+d) w s    -- intersection
-                                      : rollW (k+w) t u
+                       | otherwise = scanl (\c d->c+p*d) (p*p) (d:w)
-                          | True  = rollW (k+w) t ps
+                                       : hitsW (k+d) w ps   --   k==p
-joinT3 ((x:xs): ~(ys:zs:t)) = x : union xs (union ys zs)
-                                   `union` joinT3 (pairs t)
 wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:
 :8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel
+  -- cycle $ zipWith (-) =<< tail $ [i | i <- [11..11+_210], gcd i _210 == 1]
+  --   where _210 = product [2,3,5,7]
 </haskell>
+The <code>hitsW</code> function is there to find the starting point for rolling the wheel for each prime, but this can be found directly:
-The compiler switch <code>-fno-cse</code> is used here to prevent space leak.
+<haskell>
+primesW = [2,3,5,7] ++ _Y ( (11:) . tail . gapsW 11 wheel . joinT .
+                            map (\p ->
+                              map (p*) . dropWhile (< p) $
+                                scanl (+) (p - rem (p-11) 210) wheel) )
+</haskell>
+Seems to run about 1.4x faster, too.
 ====Above Limit - Offset Sieve====
-Another task is to produce primes above a given value (i.e. not by their ordinal numbers):
+Another task is to produce primes above a given value:
 <haskell>
 {-# OPTIONS_GHC -O2 -fno-cse #-}
 primesFromTMWE primes m = dropWhile (< m) [2,3,5,7,11]
-                          ++ gapsW a wh' (compositesFrom a)
+                          ++ gapsW a wh2 (compositesFrom a)
-  where
+    where
-    (a,wh') = rollFrom (snapUp (max 3 m) 3 2)
+    (a,wh2) = rollFrom (snapUp (max 3 m) 3 2)
-    (h,p':t) = span (< z) $ drop 4 primes           -- p < z => p*p<=a
+    (h,p2:t) = span (< z) $ drop 4 primes           -- p < z => p*p<=a
-    z = ceiling $ sqrt $ fromIntegral a + 1         -- p'>=z => p'*p'>a
+    z = ceiling $ sqrt $ fromIntegral a + 1         -- p2>=z => p2*p2>a
-    compositesFrom a = joinT (joinST [multsOf p a | p <- h++[p']]
+    compositesFrom a = joinT (joinST [multsOf p  a    | p <- h ++ [p2]]
-                              : [multsOf p (p*p) | p <- t])
+                                   : [multsOf p (p*p) | p <- t] )
-snapUp v o step = v + ((o-v) `mod` step)            -- full steps from o
+snapUp v o step = v + (mod (o-v) step)              -- full steps from o
 multsOf p from = scanl (\c d->c+p*d) (p*x) wh       -- map (p*) $
-  where                                             --   scanl (+) x wh
+    where                                           --   scanl (+) x wh
     (x,wh) = rollFrom (snapUp from p (2*p) `div` p) --   , if p < from
-wheelNums = scanl (+) 0 wheel
+wheelNums  = scanl (+) 0 wheel
 rollFrom n = go wheelNums wheel
+    where
-  where m = (n-11) `mod` 210
-        go (x:xs) ws@(w:ws') | x < m = go xs ws'
+    m = (n-11) `mod` 210
-                             | True  = (n+x-m, ws)  -- (x >= m)
+    go (x:xs) ws@(w:ws2) | x < m = go xs ws2
+                         | True  = (n+x-m, ws)      -- (x >= m)
 </haskell>
@@ Line 275: / Line 411: @@
 <haskell>
 primesFrom m = filter isPrime [m..]
+</haskell>
+=== Map-based ===
+Runs ~1.7x slower than [[#Tree_merging|TME version]], but with the same empirical time complexity, ~<math>n^{1.2}</math> (in ''n'' primes produced) and  same very low (near constant) memory consumption:
+<haskell>
+import Data.List                 -- based on
+import qualified Data.Map as M   --   http://stackoverflow.com/a/1140100
+primesMPE :: [Integer]
+primesMPE = 2 : mkPrimes 3 M.empty prs 9   -- postponed sieve enlargement
+    where                                 -- by decoupled primes feed loop
+    prs = 3 : mkPrimes 5 M.empty prs 9
+    mkPrimes n m ps@ ~(p:pt) q = case (M.null m, M.findMin m) of
+      { (False, (n2, skips)) | n == n2 ->
+             mkPrimes (n+2) (addSkips n (M.deleteMin m) skips) ps q
+      ; _ -> if n < q
+             then    n : mkPrimes (n+2)  m                  ps q
+             else        mkPrimes (n+2) (addSkip n m (2*p)) pt (head pt^2)
+      }
+    addSkip n m s = M.alter (Just . maybe [s] (s:)) (n+s) m
+    addSkips = foldl' . addSkip
 </haskell>
 == Turner's sieve - Trial division ==
-David Turner's original 1975 formulation ''(SASL Language Manual, 1975)'' replaces non-standard ''`minus`'' in the sieve of Eratosthenes by stock list comprehension with ''rem'' filtering, turning it into a kind of trial division algorithm:
+David Turner's ''(SASL Language Manual, 1983)'' formulation replaces non-standard <code>minus</code> in the sieve of Eratosthenes by stock list comprehension with <code>rem</code> filtering, turning it into a trial division algorithm, for clarity and simplicity:
 <haskell>
--- unbounded sieve, premature filters
+  -- unbounded sieve, premature filters
-primesT = 2 : sieve [3,5..]  where
+primesT = sieve [2..]
+          where
-    sieve (p:xs) = p : sieve [x | x <- xs, rem x p /= 0]
-                        -- filter ((/=0).(`rem`p)) xs
+          sieve (p:xs) = p : sieve [x | x <- xs, rem x p > 0]
+  -- map head
+  --    $ iterate (\(p:xs) -> [x | x <- xs, rem x p > 0]) [2..]
 </haskell>
-This creates an immense number of superfluous implicit filters in extremely premature fashion. To be admitted as prime, ''each number'' will be ''tested for divisibility'' here by all its preceding primes potentially, while just those not greater than its square root would suffice. To find e.g. the '''1001'''st prime (<code>7927</code>), '''1000''' filters are used, when in fact just the first '''24''' are needed (up to <code>89</code>'s filter only). Operational overhead is enormous here.
+This creates many superfluous implicit filters, because they are created prematurely. To be admitted as prime, ''each number'' will be ''tested for divisibility'' here by all its preceding primes, while just those not greater than its square root would suffice. To find e.g. the '''1001'''st prime (<code>7927</code>), '''1000''' filters are used, when in fact just the first '''24''' are needed (up to <code>89</code>'s filter only). Operational overhead here is huge; theoretically, it has ''quadratic'' time complexity, in the number of produced primes.
 === Guarded Filters ===
-But this really ought to be changed into bounded and guarded variant, again achieving the ''"miraculous"'' complexity improvement from above quadratic to about <math>O(n^{1.45})</math> empirically (in ''n'' primes produced):
+But this really ought to be changed into the ''abortive'' variant, [[#From Squares|again achieving]] the ''"miraculous"'' complexity improvement from above quadratic to about <math>O(n^{1.45})</math> ''empirically'' (in ''n'' primes produced) by stopping the sieving as soon as possible:
 <haskell>
-primesToGT m = 2 : sieve [3,5..m]  where
+primesToGT m = sieve [2..m]
+    where
     sieve (p:xs)
-       | p*p > m = p : xs
+        | p*p > m = p : xs
-       | True    = p : sieve [x | x <- xs, rem x p /= 0]
+        | True    = p : sieve [x | x <- xs, rem x p > 0]
+  -- (\(a,b:_) -> map head a ++ b) . span ((< m).(^2).head) $
+  --     iterate (\(p:xs) -> [x | x <- xs, rem x p > 0]) [2..m]
 </haskell>
 === Postponed Filters ===
-or as unbounded, postponed filters-creation variant:
+Or it can remain unbounded, just filters creation must be ''postponed'' until the right moment:
 <haskell>
+primesPT1 = 2 : sieve primesPT1 [3..]
-primesPT = 2 : primes'
-  where
+    where
+    sieve (p:pt) xs = let (h,t) = span (< p*p) xs
-    primes' = sieve [3,5..] 9 primes'
-    sieve (x:xs) q ps@ ~(p:t)
+                      in h ++ sieve pt [x | x <- t, rem x p > 0]
-       | x < q   = x : sieve xs q ps
+  -- fix $ concatMap (fst . fst)
-       | True    =     sieve [x | x <- xs, rem x p /= 0] (head t^2) t
+  --     . iterate (\((_,xs), p:pt) -> let (h,t) = span (< p*p) xs in
+  --                              ((h, [x | x <- t, rem x p > 0]), pt))
+  --     . (,) ([2],[3..])
+</haskell>
+It can be re-written with <code>span</code> and <code>(++)</code> inlined and fused into the <code>sieve</code>:
+<haskell>
+primesPT = 2 : oddprimes
+    where
+    oddprimes = sieve [3,5..] 9 oddprimes
+    sieve (x:xs) q ps@ ~(p:pt)
+        | x < q = x : sieve xs q ps
+        | True  =     sieve [x | x <- xs, rem x p /= 0] (head pt^2) pt
+</haskell>
+creating here [[#Linear merging |as well]] the linear filtering nested structure at run-time, <code>(...(([3,5..] >>= filterBy [3]) >>= filterBy [5])...)</code>, but unlike the non-postponed code each filter being created at its proper moment, not sooner than the prime's square is seen.
+<haskell>
+filterBy ds n = [n | noDivs n ds]    -- `ds` assumed to be non-decreasing
+noDivs n ds = foldr (\d r -> d*d > n || (rem n d > 0 && r)) True ds
 </haskell>
-creating here [[#Linear merging |as well]] the linear nested structure at run-time, <code>(...(([3,5..] >>= filterBy [3]) >>= filterBy [5])...)</code>, where <code>filterBy ds n = [n | noDivs ds n]</code> (see <code>noDivs</code> definition below) &thinsp;&ndash;&thinsp; but unlike the original code, each filter being created at its proper moment, not sooner than the prime's square is seen.
 === Optimal trial division ===
-The above is equivalent to the traditional formulation of trial division,
+The above is algorithmically equivalent to the traditional formulation of trial division,
 <haskell>
+ps = 2 : [i | i <- [3..],
-noDivs factors n = foldr (\f r -> f*f > n || (rem n f /= 0 && r))
-                         True factors
+              and [rem i p > 0 | p <- takeWhile (\p -> p^2 <= i) ps]]
+</haskell>
--- primes = filter (noDivs [2..]) [2..]
+or,
-primesTD = 2 : 3 : filter (noDivs $ tail primesTD) [5,7..]
+<haskell>
-isPrime n = n > 1 && noDivs primesTD n
+-- primes = filter (`noDivs`[2..]) [2..]
+--        = 2 : filter (`noDivs`[3,5..]) [3,5..]
+primesTD  = 2 : 3 : filter (`noDivs` tail primesTD) [5,7..]
+isPrime n = n > 1 && noDivs n primesTD
 </haskell>
 except that this code is rechecking for each candidate number which primes to use, whereas for every candidate number in each segment between the successive squares of primes these will just be the same prefix of the primes list being built.
+Trial division is used as a simple [[Testing primality#Primality Test and Integer Factorization|primality test and prime factorization algorithm]].
 === Segmented Generate and Test ===
+Next we turn [[#Postponed Filters |the list of filters]] into one filter by an ''explicit'' list, each one in a progression of prefixes of the primes list. This seems to eliminate most recalculations, explicitly filtering composites out from batches of odds between the consecutive squares of primes.
-This ''primes prefix's length'' can be explicitly maintained, achieving a certain further speedup (though not in complexity which stays the same) by turning a list of filters into one filter by an explicit list of primes:
 <haskell>
+import Data.List
-primesST = 2 : primes'
-  where
+primesST = 2 : ops
-    primes' = sieve 3 9 primes' 0
+    where
-    sieve x q ~(_:t) k = let fs = take k primes' in
-      filter ((`all` fs) . ((/=0).) . rem) [x,x+2..q-2]
+    ops = sieve 3 9 ops (inits ops)               -- odd primes
+        -- (sieve 3 9 <*> inits) ops              -- inits: [],[3],[3,5],...
-      ++ sieve (q+2) (head t^2) t (k+1)
+    sieve x q ~(_:pt) (fs:ft) =
+        filter ((`all` fs) . ((> 0).) . rem) [x,x+2..q-2]
+        ++ sieve (q+2) (head pt^2) pt ft
+</haskell>
+This can also be coded as, arguably more readable,
+<haskell>
+ primesSGT =
+    {- 2 : [ n | (px, r:q:_) <- zip (inits primesSGT)
+                                    (tails $ 2 : map (^2) primesSGT),
+                 n <- [r+1..q-1],  all ((> 0) . rem n) px ]
+     -}
+: ops
+    where
+    ops = 3 : [n | (px, r:q:_) <- zip (inits ops) (tails $ 3 : map (^2) ops),
+                   n <- [r+2,r+4..q-2],  all ((> 0) . rem n) px]
+                -- n <- foldl (>>=) [r+2,r+4..q-2]           -- chain of filters
+                --                   [filterBy [p] | p <- px]]        -- OR,
+                -- n <- [r+2,r+4..q-2] >>= filterBy px]      -- a filter by a list
 </haskell>
-This seems to eliminate most recalculations, explicitly filtering composites out from batches of odds between the consecutive squares of primes.
 ==== Generate and Test Above Limit ====
-The following will start the segmented Turner sieve at the right place, using any primes list it's supplied with (e.g. [[Prime_numbers#Tree_merging_with_Wheel | TMWE]] etc.), demand computing it up to the square root of any prime it'll produce:
+The following will start the segmented Turner sieve at the right place, using any primes list it's supplied with (e.g. [[#Tree_merging_with_Wheel | TMWE]] etc.) or itself, as shown, demand computing it just up to the square root of any prime it'll produce:
 <haskell>
-primesFromST primes m
+primesFromST m | m <= 2 = 2 : primesFromST 3
+primesFromST m | m > 2 =
-   | m>2 = sieve (m`div`2*2+1) (head ps^2) (tail ps) (length h-1)
+         sieve (m`div`2*2+1) (head ps^2) (tail ps) (inits ps)
-  where
+    where
-    (h,ps) = span (<= (floor.sqrt $ fromIntegral m+1)) primes
-    sieve x q ps k = let fs = take k $ tail primes in
+    (h,ps) = span (<= (floor.sqrt $ fromIntegral m+1)) ops
-      filter ((`all` fs) . ((/=0).) . rem) [x,x+2..q-2]
+    sieve x q ps (fs:ft) =
-      ++ sieve (q+2) (head ps^2) (tail ps) (k+1)
+        filter ((`all` (h ++ fs)) . ((> 0) .) . rem) [x,x+2..q-2]
+        ++ sieve (q+2) (head ps^2) (tail ps) ft
+    ops = 3 : primesFromST 5                      -- odd primes
+-- ~> take 3 $ primesFromST 100000001234
+-- [100000001237,100000001239,100000001249]
 </haskell>
-This is usually faster than testing candidate numbers for divisibility [[#Optimal trial division|one by one]] which has to re-fetch anew the needed prime factors to test by, for each candidate. Faster is the [[99_questions/Solutions/39#Solution_4.|offset sieve of Eratosthenes on odds]], and yet faster the above one, [[Prime_numbers#Above_Limit_-_Offset_Sieve|w/ wheel optimization]].
+This is usually faster than testing candidate numbers for divisibility [[#Optimal trial division|one by one]] which has to re-fetch anew the needed prime factors to test by, for each candidate. Faster is the [[99_questions/Solutions/39#Solution_4.|offset sieve of Eratosthenes on odds]], and yet faster the one [[#Above_Limit_-_Offset_Sieve|w/ wheel optimization]], on this page.
 === Conclusions ===
-All these variants of course being variations of trial division &ndash; finding out primes by direct divisibility testing of every odd number by sequential primes below its square root (instead of just by ''its factors'', which is what ''direct generation of multiples'' is doing, essentially) &ndash; are thus of principally worse complexities than that of Sieve of Eratosthenes.
+All these variants being variations of trial division, finding out primes by direct divisibility testing of every candidate number by sequential primes below its square root (instead of just by ''its factors'', which is what ''direct generation of multiples'' is doing, essentially), are thus principally of worse complexity than that of Sieve of Eratosthenes.
+The initial code is just a one-liner that ought to have been regarded as ''executable specification'' in the first place. It can easily be improved quite significantly with a simple use of bounded, guarded formulation to limit the number of filters it creates, or by postponement of filter creation.
-The prototype code is very inefficient, yet improving significantly with just a simple use of bounded, guarded formulation. So, there's no harm in applying simple improvements to a code instead of just labeling it "naive". BTW were divisibility testing somehow turned into an <math>O(1)</math> operation, e.g. by some kind of massive parallelization, the overall complexity of trial division sieve would drop to <math>O(n)</math>. It's the sequentiality of testing that's the culprit.
-Did Eratosthenes himself achieve the optimal complexity? It rather seems doubtful, as he probably counted boxes in the table by ''1'' to go from one number to the next, as in '''3''',''5'',''7'','''9''',''11'',''13'','''15''', ... for he had no access even to Hindu numerals, using Greek alphabet for writing down numbers instead. Was he performing a ''genuine'' sieve of Eratosthenes then? Should faithfulness of an algorithm's implementation be judged by its performance?
-So the initial Turner code is just a one-liner that ought to have been regarded as ''executable specification'' in the first place. The reason it was taught that way is probably so that it could provoke discussion among students, at the very least discovering the optimization technique of the postponement of filter-creation.
 == Euler's Sieve ==
@@ Line 366: / Line 575: @@
 <haskell>
-primesEU = 2 : euler [3,5..] where
+primesEU = 2 : eulers [3,5..]
+  where
-    euler (p:xs) = p : euler (xs `minus` map (p*) (p:xs))
+    eulers (p:xs) = p : eulers (xs `minus` map (p*) (p:xs))
+ -- eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+2*p..])
 </haskell>
-This code is extremely inefficient, running above <math>O({n^{2}})</math> complexity (and worsening rapidly), and should be regarded a ''specification'' only. Its memory usage is very high, with space complexity just below <math>O({n^{2}})</math>, in ''n'' primes produced.
+This code is extremely inefficient, running above <math>O({n^{2}})</math> empirical complexity (and worsening rapidly), and should be regarded a ''specification'' only. Its memory usage is very high, with empirical space complexity just below <math>O({n^{2}})</math>, in ''n'' primes produced.
+In the stream-based sieve of Eratosthenes we are able to ''skip'' along the input stream <code>xs</code> directly to the prime's square, consuming the whole prefix at once, thus achieving the results equivalent to the postponement technique, because the generation of the prime's multiples is independent of the rest of the stream.
+But here in the Euler's sieve it ''is'' dependent on all <code>xs</code> and we're unable ''in principle'' to skip along it to the prime's square - because all <code>xs</code> are needed for each prime's multiples generation. Thus efficient unbounded stream-based implementation seems to be impossible in principle, under the simple scheme of producing the multiples by multiplication.
 === Wheeled list representation ===
@@ Line 422: / Line 637: @@
 === Generating Segments of Primes ===
-The sieve of Eratosthenes'es [[#Segmented|removal of multiples on each segment of odds]] can be done by actually marking them in an array instead of manipulating the ordered lists, and can be further sped up more than twice by working with odds only, represented as their offsets in segment arrays:
+The sieve of Eratosthenes' [[#Segmented|removal of multiples on each segment of odds]] can be done by actually marking them in an array, instead of manipulating ordered lists, and can be further sped up more than twice by working with odds only:
 <haskell>
-import Data.Array
+import Data.Array.Unboxed
--- import Data.Array.Unboxed
-primesSA = 2 : prs
+primesSA :: [Int]
+primesSA = 2 : oddprimes ()
-  where
+  where
-    prs = 3 : sieve prs 3 []
+    oddprimes = (3 :) . sieve 3 [] . oddprimes
-    sieve (p:ps) x fs = [i*2 + x | (i,e) <- assocs a, e]
-                        ++ sieve ps (p*p) fs'
+    sieve x fs (p:ps) = [i*2 + x | (i,True) <- assocs a]
+                        ++ sieve (p*p) ((p,0) :
-     where
-      q     = (p*p-x)`div`2
+                             [(s, rem (y-q) s) | (s,y) <- fs]) ps
+      where
-      fs'   = (p,0) : [(s, rem (y-q) s) | (s,y) <- fs]
-      -- a     :: UArray Int Bool
+      q = (p*p-x)`div`2
+      a :: UArray Int Bool
-      a     = accumArray (\ b c -> False) True (1,q-1)
-                         [(i,()) | (s,y) <- fs, i <- [y+s, y+s+s..q]]
+      a = accumArray (\ b c -> False) True (1,q-1)
+                     [(i,()) | (s,y) <- fs, i <- [y+s, y+s+s..q]]
 </haskell>
+Runs significantly faster than [[#Tree merging with Wheel|TMWE]] and with better empirical complexity, of about <math>O(n^{1.10..1.05})</math> in producing first few millions of primes, with constant memory footprint.
-Run on ''[http://ideone.com/eRP1x Ideone.com]'' it is somewhat faster than [[#Tree Merging With Wheel|Tree Merging With Wheel]] in producing first million primes or two, but has worse time complexity and large memory footprint which quickly gets into hundreds of MBs.
-When the (commented out above) explicit type signature is added, making the same code work on ''unboxed arrays'', the resulting GHC-compiled code runs more than twice faster and with better empirical complexity, of about <math>O(n^{1.10..1.05})</math> in producing first few millions of primes, with smaller though still growing memory footprint. Fixed sized segments are usually used in segmented sieves to limit the memory usage.
 === Calculating Primes Upto a Given Value ===
@@ Line 458: / Line 671: @@
                         [(i,()) | i <- [9,15..n]]
     f p a | q > n = a
-          | True  = if null x then a' else f (head x) a'
+          | True  = if null x then a2 else f (head x) a2
       where q = p*p
-            a' :: UArray Int Bool
+            a2  :: UArray Int Bool
-            a'= a // [(i,False) | i <- [q, q+2*p..n]]
+            a2 = a // [(i,False) | i <- [q, q+2*p..n]]
-            x = [i | i <- [p+2,p+4..n], a' ! i]
+            x  = [i | i <- [p+2,p+4..n], a2 ! i]
 </haskell>
@@ Line 471: / Line 684: @@
                       ++ [i | i <- [o,o+2..b], ar ! i]
   where
-    o  = max (if even a then a+1 else a) 3
+    o  = max (if even a then a+1 else a) 3   -- first odd in the segment
     r  = floor . sqrt $ fromIntegral b + 1
-    ar = accumArray (\a b-> False) True (o,b)
+    ar = accumArray (\_ _ -> False) True (o,b)  -- initially all True,
           [(i,()) | p <- [3,5..r]
-                    , let q  = p*p
+                    , let q  = p*p      -- flip every multiple of an odd
-                          s  = 2*p
+                          s  = 2*p                         -- to False
                           (n,x) = quotRem (o - q) s
-                          q' = if  o <= q  then q
+                          q2 = if  o <= q  then q
                                else  q + (n + signum x)*s
-                    , i <- [q',q'+s..b] ]
+                    , i <- [q2,q2+s..b] ]
 </haskell>
-Although using odds instead of primes, the array generation is so fast that it is very much feasible and even preferable for quick generation of some short spans of relatively big primes.
+Although sieving by odds instead of by primes, the array generation is so fast that it is very much feasible and even preferable for quick generation of some short spans of relatively big primes.
 == Using Mutable Arrays ==
@@ Line 518: / Line 731: @@
 Its [http://ideone.com/KwZNc empirical time complexity] is improving with ''n'' (number of primes produced) from above <math>O(n^{1.20})</math> towards <math>O(n^{1.16})</math>. The reference [http://ideone.com/FaPOB C++ vector-based implementation] exhibits this improvement in empirical time complexity too, from <math>O(n^{1.5})</math> gradually towards <math>O(n^{1.12})</math>, where tested ''(which might be interpreted as evidence towards the expected [http://en.wikipedia.org/wiki/Computation_time#Linearithmic.2Fquasilinear_time quasilinearithmic] <math>O(n \log(n)\log(\log n))</math> time complexity)''.
+=== Using Page-Segmented ST-Mutable Unboxed Array ===
-=== Bitwise prime sieve with Template Haskell ===
+The following code is similar to the above in using (automatically by default) bit-packed unboxed arrays sieving "odds-only", but adds the feature of page-segmentation where each page is about the size of the CPU L2 cache (2^20 or about a million bits in the example) so can sieve reasonably efficiently without "cache thrashing" to this number squared or about 10^12.  Using page segmentation, it is able to sieve to produce the primes to this limit only using the memory space required for three of these buffers, a (recursive) two used to sieve for the base primes up to a million (actually up to two million but only a million are used to this limit) and the third used for the page segments where they represent increasing values as the computation progresses.
-Count the number of prime below a given 'n'. Shows fast bitwise arrays,
-and an example of [[Template Haskell]] to defeat your enemies.
+Thus, only less than 400 Kilobytes is used sieving up to the limit of 10^12 rather than the about 62.5 Gigabytes required for the one-huge-monolithic array version (even bit-packed), and even if a computer had this much memory, the huge cost in memory access time of up up to about a hundred CPU clock cycles per access due to "cache-thrashing" as compared to the average of four or five CPU clock cycles as used in this code.
-<haskell>
-{-# OPTIONS -O2 -optc-O -XBangPatterns #-}
-module Primes (nthPrime) where
+If one places the following GHC Haskell code in a file called TestPrimes.hs:
-import Control.Monad.ST
+<haskell>{-# OPTIONS_GHC -O2 #-}
-import Data.Array.ST
-import Data.Array.Base
-import System
-import Control.Monad
-import Data.Bits
+import Data.Int ( Int64 )
-nthPrime :: Int -> Int
+import Data.Word ( Word64 )
-nthPrime n = runST (sieve n)
+import Data.Bits ( Bits(shiftR) )
+import Data.Array.Base ( IArray(unsafeAt), UArray(UArray),
+                         MArray(unsafeWrite), unsafeFreezeSTUArray )
+import Control.Monad ( forM_ )
+import Data.Array.ST ( MArray(newArray), runSTUArray )
+type Prime = Word64
-sieve n = do
-    a <- newArray (3,n) True :: ST s (STUArray s Int Bool)
-    let cutoff = truncate (sqrt $ fromIntegral n) + 1
-    go a n cutoff 3 1
+cSieveBufferLimit :: Int
-go !a !m cutoff !n !c
+cSieveBufferLimit = 2^17 * 8 - 1 -- CPU L2 cache in bits
-    | n >= m    = return c
-    | otherwise = do
-        e <- unsafeRead a n
-        if e then
-          if n < cutoff then
-            let loop !j
-                 | j < m     = do
-                        x <- unsafeRead a j
-                        when x $ unsafeWrite a j False
-                        loop (j+n)
-                 | otherwise = go a m cutoff (n+2) (c+1)
-            in loop ( if n < 46340 then n * n else n `shiftL` 1)
-           else go a m cutoff (n+2) (c+1)
-         else go a m cutoff (n+2) c
-</haskell>
+primes :: () -> [Prime]
-And place in a module:
+primes() = 2 : _Y (listPagePrms . pagesFrom 0) where
+  _Y g = g (_Y g) -- non-sharing multi-stage fixpoint combinator
+  listPagePrms pgs@(hdpg@(UArray lwi _ rng _) : tlpgs) =
+    let loop i | i >= fromIntegral rng = listPagePrms tlpgs
+               | unsafeAt hdpg i = loop (i + 1)
+               | otherwise = let ii = lwi + fromIntegral i in
+                             case fromIntegral $ 3 + ii + ii of
+                               p -> p `seq` p : loop (i + 1) in loop 0
+  makePg lwi bps = runSTUArray $ do
+    let limi = lwi + fromIntegral cSieveBufferLimit
+        bplmt = floor $ sqrt $ fromIntegral $ limi + limi + 3
+        strta bp = let si = fromIntegral $ (bp * bp - 3) `shiftR` 1
+                   in if si >= lwi then fromIntegral $ si - lwi else
+                   let r = fromIntegral (lwi - si) `mod` bp
+                   in if r == 0 then 0 else fromIntegral $ bp - r
+    cmpsts <- newArray (lwi, limi) False
+    fcmpsts <- unsafeFreezeSTUArray cmpsts
+    let cbps = if lwi == 0 then listPagePrms [fcmpsts] else bps
+    forM_ (takeWhile (<= bplmt) cbps) $ \ bp ->
+      forM_ (let sp = fromIntegral $ strta bp
+             in [ sp, sp + fromIntegral bp .. cSieveBufferLimit ]) $ \c ->
+        unsafeWrite cmpsts c True
+    return cmpsts
+  pagesFrom lwi bps = map (`makePg` bps)
+                          [ lwi, lwi + fromIntegral cSieveBufferLimit + 1 .. ]
+main :: IO ()
-<haskell>
+main = print $ last $ take (10^ (10 :: Int)) $ primes()</haskell>
-{-# OPTIONS -fth #-}
-import Primes
+Run as (for Linux-like systems):
-main = print $( let x = nthPrime 10000000 in [| x |] )
-</haskell>
+<haskell>$ ghc TestPrimes
-Run as:
+$ time ./TestPrimes
+179424673
+real    0m0.714s
-<haskell>
+user    0m0.703s
-$ ghc --make -o primes Main.hs
+sys     0m0.009s</haskell>
-$ time ./primes
+The above code finds the first ten million primes in about 0.71 seconds, the first hundred million primes in about 7.6 seconds, the first thousand million (billion) primes in about 85.7, and the first ten billion primes in about 1087 seconds (Intel i5-6500 at 3.6 Ghz single threaded boost) for a empirical growth factor of about 1.1 to 1.2, which is about the theoretical limit.
-./primes  0.00s user 0.01s system 228% cpu 0.003 total
+Most of the above times are the time taken for the primes list processing, where the actual culling time is some fraction of the total time, with further optimizations plus multi-threading on multi-core CPU's able to make this small fraction even smaller. This is why conventional functional/incremental SoE's have very limited use for finding the first million primes or so as compared to use of mutable unboxed array based page-segmented SoE's which should be used for larger ranges, and have further optimizations reducing culling operation time by as much as an average of two times for these ranges, as well as the mentioned easy multi-threading by pages and further reductions in the number of operations through wheel-factorization, etc.
-</haskell>
 == Implicit Heap ==
@@ Line 588: / Line 807: @@
 See [[Prime_numbers_miscellaneous#Using_IntSet_for_a_traditional_sieve | Using IntSet for a traditional sieve]].
+== Prime Counting Functions ==
+For counting of the number of primes up to a limit, just using a prime sieve is not the fastest way, especially when the ranges are over a few millions, and a prime counting function based on the inclusion-exclusion principle is much faster as although the operations are slower integer division operations rather than just simple adding and shifting integer operations, there are many less operations because the execution complexity is something like O(n^(2/3)) for Meissel based functions and O(n^(3/4)) for Legendre functions as per the one shown here.  This operational complexity reduction comes about due to the inclusion-exclusion principle, which does not require knowledge of all the primes over the counting range as only a subset is required, which are found using a sieve.  In this example, sieving to the square root of the counting range is a small part of the overall time so a basic odds-only Sieve of Eratosthenes is sufficient since most of the time will be spent determining the counts of primes to various points across the range.
+The following code works by using the non-recursive partial sieving definition of the Legendre Phi function where for each base prime sieving pass up to the quad root of the counting range, a "smalls" Look Up Table (LUT) array of the counts of the remaining potential primes (which are only partially sieved, so may still be compound) is maintained, a LUT array of the remaining "roughs" or potential primes above the sieved base primes is updated, and a "larges" LUT array of the counting range divided by each of the remaining "roughs" is updated per sieving pass for each of the remaining potential "roughs".  Following complete sieving, a final adjustment is made to compensate for the counting range divided by the product of pairs of primes above the quad root of the counting range where the primes must never be the same value.  Since the LUT arrays represent actual count values (including the count of the sieving base primes) and not Phi values where the base primes are not included in the counts but one is, there is a compensation for this discrepancy at the end of each partial sieving loop and at the end of the "products of pairs" loop.  Finally, after combining the counts into one result value, one is added to the count for the only even prime of two.
+The following Legendre prime counting function code is iterative (not recursive although the iterative loops are expressed recursively) through the use of partial sieving where the culling array is sieved and counted one base prime at a time as is one interpretation of the Legendre Phi function (save as a file named PrimeCount.hs):
+<haskell>{-# OPTIONS_GHC -O2 -fllvm #-}
+{-# LANGUAGE FlexibleContexts, BangPatterns #-}
+import Data.Int (Int64, Int32)
+import Data.Bits (Bits(shiftR), (.|.))
+import Control.Monad (forM_)
+import Control.Monad.ST (ST, runST)
+import Data.Array.Base (STUArray, MArray(unsafeNewArray_), newArray,
+                        unsafeAt, unsafeFreezeSTUArray, unsafeRead, unsafeWrite)
+primeCount :: Int64 -> Int64
+primeCount n =
+  if n < 9 then (if n < 2 then 0 else (n + 1) `div` 2) else
+  let
+    {-# INLINE divide #-}
+    divide :: Int64 -> Int64 -> Int
+    divide nm d = truncate $ (fromIntegral nm :: Double) / fromIntegral d
+    {-# INLINE half #-}
+    half :: Int -> Int
+    half x = (x - 1) `shiftR` 1
+    rtlmt = floor $ sqrt (fromIntegral n :: Double)
+    mxndx = (rtlmt - 1) `div` 2
+    (npc, ns, smalls, roughs, larges) = runST $ do
+      mss <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int32)
+      forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mss i (fromIntegral i)
+      mrs <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int32)
+      forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mrs i (fromIntegral i * 2 + 1)
+      mls <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int64)
+      forM_ [ 0 .. mxndx ] $ \ i ->
+        let d = fromIntegral (i + i + 1)
+        in unsafeWrite mls i (fromIntegral (divide n d - 1) `div` 2)
+      skips <- newArray (0, mxndx) False :: ST s (STUArray s Int Bool)
+      let loop i !pc !cs =
+            let sqri = (i + i) * (i + 1) in
+            if sqri > mxndx then do
+              fss <- unsafeFreezeSTUArray mss
+              frs <- unsafeFreezeSTUArray mrs
+              fls <- unsafeFreezeSTUArray mls
+              return (pc, cs + 1, fss, frs, fls)
+            else do
+              v <- unsafeRead skips i
+              if v then loop (i + 1) pc cs else do
+                unsafeWrite skips i True
+                let bp = i + i + 1
+                    cull c = if c > mxndx then return () else do
+                               unsafeWrite skips c True; cull (c + bp)
+                    part k !s =
+                      if k > cs then return (s - 1) else do
+                        p <- unsafeRead mrs k
+                        t <- unsafeRead skips ((fromIntegral p - 1) `shiftR` 1)
+                        if t then part (k + 1) s else do
+                          olv <- unsafeRead mls k
+                          let d = fromIntegral p * fromIntegral bp
+                          adjv <- -- return olv
+                                  if d <= fromIntegral rtlmt then do
+                                    sv <- unsafeRead mss (fromIntegral d `shiftR` 1)
+                                    unsafeRead mls (fromIntegral sv - pc)
+                                  else do
+                                    sv <- unsafeRead mss (half (divide n d))
+                                    return (fromIntegral sv)
+                          unsafeWrite mls s (olv - adjv + fromIntegral pc)
+                          unsafeWrite mrs s p; part (k + 1) (s + 1)
+                    k0 = ((rtlmt `div` bp) - 1) .|. 1
+                    adjc !j k =
+                      if k < bp then return () else do
+                        c <- unsafeRead mss (k `shiftR` 1)
+                        let ac = c - fromIntegral pc
+                            e = (k * bp) `shiftR` 1
+                            adj m = if m < e then adjc m (k - 2)
+                                    else do ov <- unsafeRead mss m
+                                            unsafeWrite mss m (ov - ac)
+                                            adj (m - 1)
+                        adj j
+                cull sqri; his <- part 0 0; adjc mxndx k0; loop (i + 1) (pc + 1) his
+      loop 1 0 mxndx
+    ans0 = unsafeAt larges 0 + fromIntegral ((ns + 2 * (npc - 1)) * (ns - 1) `div` 2)
+    accum j !ans =
+      if j >= ns then ans else
+      let q = fromIntegral (unsafeAt roughs j)
+          m = n `div` q
+          e = fromIntegral (unsafeAt smalls (half (fromIntegral (m `div` q)))) - npc
+          comp k !ac =
+            if k > e then ac else
+            let p = fromIntegral (unsafeAt roughs k)
+                ci = half (fromIntegral (divide m p))
+            in comp (k + 1) (ac + fromIntegral (unsafeAt smalls ci))
+      in if e <= j then ans else
+         accum (j + 1) (comp (j + 1) ans - fromIntegral ((e - j) * (npc + j - 1)))
+  in accum 1 (ans0 - sum [ unsafeAt larges i | i <- [ 1 .. ns - 1 ] ]) + 1
+main :: IO ()
+main = print $ primeCount (10^(11 :: Int))</haskell>
+Run as (Linux-like OS's):
+<haskell>ghc PrimeCount
+time ./PrimeCount
+4118054813
+real    0m0.033s
+user    0m0.029s
+sys     0m0.004s</haskell>
+This time of about 29 milliseconds to count the primes to 10^11 and the ability to count the primes to 10^16 in about 73 seconds (Intel i5-6500 at 3.6 GHZ) shows the empirical growth of about 0.68, which is why it can count extended count ranges so fast where using a sieve to count the range to 10^16 with a computational complexity a little above one would take over a thousand hours single-threaded as here, even with the best of constant-factor wheel-factorization and loop-operation-time-reduction optimizations.  The main problem with this algorithm is that it uses memory as the square root of the counting range at about 800 Megabytes for the range to 10^16.
+Other prime counting function algorithms such as [https://www-users.cse.umn.edu/~odlyzko/doc/arch/meissel.lehmer.pdf Lagarias, Miller, and Odlyzko's algorithm] use memory proportional to the cube root of the counting range and have a slightly further reduced computational complexity while not significantly adding to the cost per operation as well as supporting easy multi-threading, but the code is somewhat long to show on this page of basic algorithms.
+Prime counting function algorithms can also be modified to produce the sum of the primes up to a limit.
 == Testing Primality, and Integer Factorization ==
@@ Line 600: / Line 932: @@
 == External links ==
-* http://www.cs.hmc.edu/~oneill/code/haskell-primes.zip
+* http://www.cs.hmc.edu/~oneill/code/haskell-primes.zip ([https://web.archive.org/web/20140121040638/http://www.cs.hmc.edu/~oneill/code/haskell-primes.zip archived copy])
 : A collection of prime generators; the file "ONeillPrimes.hs" contains one of the fastest pure-Haskell prime generators; code by Melissa O'Neill.
 :  WARNING: Don't use the priority queue from ''older versions'' of that file for your projects: it's broken and works for primes only by a lucky chance. The ''revised'' version of the file fixes the bug, as pointed out by Eugene Kirpichov on February 24, 2009 on the [http://www.mail-archive.com/haskell-cafe@haskell.org/msg54618.html haskell-cafe] mailing list, and fixed by Bertram Felgenhauer.
@@ Line 607: / Line 939: @@
 * Speed/memory [http://ideone.com/p0e81 comparison table] for sieve of Eratosthenes variants.
+* [http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth Empirical orders of growth] on Wikipedia.
 [[Category:Code]]