What is the ideal implementation for Sith Eratosthenes between lists, arrays, and Mutable Arrays?

In Haskell, I found three simple implementations of Sieve Eratosthenes on the Rosetta Code page.

Now my question is: which one should be used in situations?

Fixed my initial reasoning:

I guess List is the most idiomatic and easy to read for Haskeller. Is it correct? I am wondering if they are experiencing the same problems as the other grating sieve, which I then found out, did not actually execute the algorithm:
(edit: it’s shown that this is a list-based sieve that I know has problems, not the one that RosettaCode I inserted below)

primes = sieve [2..] where sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ] 

In terms of performance, an immutable array seems to be a winner. With an upper bound of m of 2000000 times were approximately:

  • 1.3s for list
  • 0.6s for an array
  • 1.8s for Mutable Array

So, I would choose Array for performance.

And, of course, Mutable Array is also easy to reason about, as I have a more insistent language background. I'm not sure why I would choose this one if I code in Haskell, though, since it is slower than others and not idiomatic.

Code copied here for reference:

List:

 primesTo m = 2 : eratos [3,5..m] where eratos (p : xs) | p*p>m = p : xs | True = p : eratos (xs `minus` [p*p, p*p+2*p..]) minus a@(x:xs) b@(y:ys) = case compare xy of LT -> x : minus xs b EQ -> minus xs ys GT -> minus a ys minus ab = a 

Immutable Array:

 import Data.Array.Unboxed primesToA m = sieve 3 (array (3,m) [(i,odd i) | i<-[3..m]] :: UArray Int Bool) where sieve pa | p*p > m = 2 : [i | (i,True) <- assocs a] | a!p = sieve (p+2) $ a//[(i,False) | i <- [p*p, p*p+2*p..m]] | otherwise = sieve (p+2) a 

Mutable Array:

 import Control.Monad (forM_, when) import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed primeSieve :: Integer -> UArray Integer Bool primeSieve top = runSTUArray $ do a <- newArray (2,top) True -- :: ST s (STUArray s Integer Bool) let r = ceiling . sqrt $ fromInteger top forM_ [2..r] $ \i -> do ai <- readArray ai when ai $ do forM_ [i*i,i*i+i..top] $ \j -> do writeArray aj False return a -- Return primes from sieve as list: primesTo :: Integer -> [Integer] primesTo top = [p | (p,True) <- assocs $ primeSieve top] 

EDIT

  • I showed Turner Sieve at the top, but this is not an example based on a list that I compare with the other two. I just wanted to find out if the example based on the “not the right solution” list of the Eratosthenes lattice “like Turner” suffers.
  • It seems that the performance comparison is unfair, because the Mutable Array example also goes through evens and uses Integer , not Int ...
+5
performance haskell primes sieve-of-eratosthenes
source share
3 answers

it

 primes = sieve [2..] where sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ] 

not a sieve. This is a very inefficient trial division. Do not use this!

I am curious how you got your time, there is no way that Turner's “sieve” can produce prime numbers not exceeding 2,000,000 in a matter of seconds. Allowing him to find prime numbers up to 200,000 taken

 MUT time 6.38s ( 6.39s elapsed) GC time 9.19s ( 9.20s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 15.57s ( 15.59s elapsed) 

on my box (64-bit Linux, ghc-7.6.1, compiled with -O2). The complexity of this algorithm is O(N² / log² N) , almost quadratic. Having lowered it by 2,000,000, it will take about twenty minutes.

Your time for array versions is also suspicious, albeit in a different direction. Have you measured the interpreted code?

Sifting to 2,000,000, compiled with optimization, the modified array code took 0.35 seconds to run, and the immutable array took 0.12 seconds.

Now it still has a mutable array about three times slower than an immutable array.

But this is an unfair comparison. For an immutable array, you used Int s, and for a mutable array, Integer s. Changing the modified array code to use Int - as it should be, since Int indexed arrays are under the hood, so using Integer is an unnecessary sacrifice of performance that doesn't buy anything - changed the code of the mutable array in 0.15 seconds. Next to the changed array code, but not quite there. However, you allow the variable array to work harder because in an immutable array you only remove odd multiples of odd primes, but in the code of the mutable array you mark all multiples of all primes. Changing the modified array code for processing 2 specifically and only eliminating odd multiple odd primes leads to the same up to 0.12 seconds.

But you use array indexing with a range of checks, which is slow, and since the validity of the indexes is checked by the code itself, it is not needed. Changing this parameter to use unsafeRead and unsafeWrite reduces the time for an immutable array to 0.09 seconds.

Then you have a problem using

 forM_ [x, y .. z] 

uses boxed Int (fortunately, GHC excludes the list). By writing a loop yourself to use only unboxed Int# , the time is reduced to 0.02 seconds.

 {-# LANGUAGE MonoLocalBinds #-} import Control.Monad (forM_, when) import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed import Data.Array.Base primeSieve :: Int -> UArray Int Bool primeSieve top = runSTUArray $ do a <- newArray (0,top) True unsafeWrite a 0 False unsafeWrite a 1 False let r = ceiling . sqrt $ fromIntegral top mark step idx | top < idx = return () | otherwise = do unsafeWrite a idx False mark step (idx+step) sift p | r < p = return a | otherwise = do prim <- unsafeRead ap when prim $ mark (2*p) (p*p) sift (p+2) mark 2 4 sift 3 -- Return primes from sieve as list: primesTo :: Int -> [Int] primesTo top = [p | (p,True) <- assocs $ primeSieve top] main :: IO () main = print .last $ primesTo 2000000 

So, wrapping for the Eratosthenes sieve, you should use an array - not surprisingly, its effectiveness depends on the possibility of moving from one multiple to the next in a short constant time.

You get very simple and clear code with immutable arrays, and this code works decently for not too high limits (it becomes relatively worse for higher limits, since arrays are still copied and collected with garbage, but this is not so bad).

If you need better performance, you need mutable arrays. Writing an effective modified array code is not entirely trivial, you need to know how the compiler translates various constructions in order to choose the correct one, and some consider this code to be uniomatic. But you can also use a library (disclaimer: I wrote it), which provides a fairly efficient implementation, rather than writing it yourself.

+13
source share

A Mutable array will always be a winner in terms of performance (and you really should have copied a version that only works with odds , at least it should be the fastest of the three - also because it uses Int , not Integer ).

For lists, incremental sieve tree merging should work better than the one you are showing. You can always use it with takeWhile (< limit) , if necessary. I affirm that it most clearly conveys the true nature of the sieve:

 import Data.List (unfoldr) primes :: [Int] primes = 2 : _Y ((3 :) . gaps 5 . _U . map (\p-> [p*p, p*p+2*p..])) _Y g = g (_Y g) -- recursion _U ((x:xs):t) = (x :) . union xs . _U -- ~= nub . sort . concat . unfoldr (\(a:b:c)->Just(union ab,c)) $ t gaps ks@(x:xs) | k < x = k:gaps (k+2) s -- ~= [k,k+2..]\\s, when | otherwise = gaps (k+2) xs -- k<=x && null(s\\[k,k+2..]) union a@(x:xs) b@(y:ys) = case compare xy of -- ~= nub . sort .: (++) LT -> x : union xs b EQ -> x : union xs ys GT -> y : union a ys 

Of course, nothing compares with the multiplicity nubBy (((>1).).gcd) [2..] .

To your 1st new question: no. He finds the multiplicity by counting, since any true sieve should (although the minus on the lists, of course, is not perfect enough, and this improves by resetting the chain of linear subtractions ((((xs-a)-b)-c)- ... ) as a subtraction of tree-added additions , xs-(a+((b+c)+(((d+ ... )) + ... ))) ).

+3
source share

As said, using mutable arrays has better performance. The following code is derived from this "TemplateHaskell" version , converted back to something more according to Array's direct mutable solution, since "TemplateHaskell" doesn't seem to have any meaning, with some further optimizations. I believe this is faster than regular modified versions of the unboxed array due to further optimizations, and especially due to the use of the unsafeRead and unsafeWrite functions, which avoid checking the range of arrays, possibly also internally using pointers to access array:

 {-# OPTIONS -O2 -optc-O3 #-} import Control.Monad import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed import Data.Array.Base primesToUA :: Word32-> [Word32] primesToUA top = do let sieveUA top = runSTUArray $ do let m = ((fromIntegral top) - 3) `div` 2 :: Int buf <- newArray (0,m) True -- :: ST s (STUArray s Int Bool) let cullUA i = do let p = i + i + 3 strt = p * (i + 1) + i let cull j = do if j > m then cullUA (i + 1) else do unsafeWrite buf j False cull (j + p) if strt > m then return () else do e <- unsafeRead buf i if e then cull strt else cullUA (i + 1) cullUA 0; return buf if top > 1 then 2 : [2 * (fromIntegral i) + 3 | (i,True) <- assocs $ sieveUA top] else [] main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln print (last (primesToUA x)) -- 0.01 0.02 0.09 1.26 seconds 

EDIT: the code above has been fixed, and the edited time has changed to reflect the correction, as well as edited comments comparing the downloaded pages with the non-downloadable version.

The times to run this with the upper ranges indicated are given in the comment table at the bottom of the source code above, as measured by ideone.com and about exactly five times faster than the response posted by @WillNess, also measured on ideone.com . This code takes a trivial amount of time to select primes up to two million and only 1.26 seconds to drop one hundred million. These times are about 2.86 times faster when I launch my i7 (3.5 GHz) from 0.44 seconds to one hundred million and takes 6.81 seconds to work up to one billion. The memory usage is more than six megabytes for the first and sixty megabytes for the latter, which is the memory used by the huge (bit-by-bit) array. This array also explains non-linear performance in that, since the size of the array exceeds the size of the CPU cache, the average memory access time degrades when displaying a collection of numbers.

EDIT_ADD: Segmented section of a page is more efficient because it has better memory access efficiency when the buffer size is less than the caching of the L1 or L2 processor, and also has the advantage that it is unlimited, so the upper range need not be specified in advance , and a much smaller amount of memory is only basic primes smaller than the square root of the range used plus page buffer memory. The following code was written as a segmented page and is somewhat faster than the version without the page; it also gives the advantage that you can change the specification of the output range at the top of the code to "Word64" from "Word32", so it is not limited to the 32-bit range of numbers with little processing time (for 32-bit compiled code) for any a range that is common. The code is as follows:

 -- from http://www.haskell.org/haskellwiki/Prime_numbers#Using_ST_Array {-# OPTIONS -O2 -optc-O3 #-} import Data.Word import Control.Monad import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed import Data.Array.Base primesUA :: () -> [Word32] primesUA () = do let pgSZBTS = 262144 * 8 let sieveUA low bps = runSTUArray $ do let nxt = (fromIntegral low) + (fromIntegral pgSZBTS) buf <- newArray (0,pgSZBTS - 1) True -- :: ST s (STUArray s Int Bool) let cullUAbase i = do let p = i + i + 3 strt = p * (i + 1) + i when (strt < pgSZBTS) $ do e <- unsafeRead buf i if e then do let loop j = do if j < pgSZBTS then do unsafeWrite buf j False loop (j + p) else cullUAbase (i + 1) loop strt else cullUAbase (i + 1) let cullUA ~(p:t) = do let bp = (fromIntegral p) i = (bp - 3) `div` 2 s = bp * (i + 1) + i when (s < nxt) $ do let strt = do if s >= low then fromIntegral (s - low) else do let b = (low - s) `rem` bp if b == 0 then 0 else fromIntegral (bp - b) let loop j = do if j < pgSZBTS then do unsafeWrite buf j False loop (j + (fromIntegral p)) else cullUA t loop strt if low <= 0 then cullUAbase 0 else cullUA bps return buf let sieveList low bps = do [2 * ((fromIntegral i) + low) + 3 | (i,True) <- assocs $ sieveUA low bps] let sieve low bps = do (sieveList low bps) ++ sieve (low + (fromIntegral pgSZBTS)) bps let primes' = ((sieveList 0 []) ++ sieve (fromIntegral pgSZBTS) primes') :: [Word32] 2 : sieve 0 primes' main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln -- 0.02 0.03 0.13 1.13 seconds print (length (takeWhile ((>=) (fromIntegral x)) (primesUA ()))) 

The above code contains several more lines of code than the case when it is not unloaded due to the need to select representations of composite numbers from the first array of pages differently than subsequent pages. The code also has corrections, so there are no memory leaks due to the list of basic primes, and the output list is now not the same list (thus avoiding keeping the entire list in memory).

Note that this code is approaching linear time (over a numbered range) as the range becomes larger due to the reject buffer being a constant size smaller than the L2 processor cache. Memory usage is part of what is used by the unguarded version at less than 600 kilobytes per hundred million and a little over 600 kilobytes per billion, which a slight increase is just the extra space needed for basic primes less than the square root of the range list.

At ideone.com, this code generates up to one hundred million primes in 1.13 seconds and from 12 seconds to one billion (32-bit). It is likely that factorization of the wheels and a certain multi-core processing will make it even faster on a multi-core processor. On my i7 (3.5 GHz), it takes 0.44 seconds to sift up to one hundred million and from 4.7 seconds to one billion, with approximately linear performance with an increase in range, as expected. It seems that the GHC version on ideone.com has some non-linear overhead, which has some performance limitation for larger ranges, which is not the case for i7, and this may be due to different garbage collection, as page buffers are created new for each new pages. END_EDIT_ADD

EDIT_ADD2: It seems that most of the processing time for the above segment of the segmented code is used in processing the (lazy) list, so the code is accordingly reformulated with several improvements as follows:

  • A simple count function is implemented that does not use list processing and uses the "popCount" table to count the number of "one" bits in a 32-bit word at a time. Thus, the time to search for results is negligible compared to the actual screening time.

  • Basic primes are stored as a list of bitmap page segments, which is much more efficient than storing a list of primes, and the time it takes to convert page segments to primes is not significant computational overhead.

  • Set up the main function of the segment so that for the initial segment of the zero page it uses its own bit pattern as the source page, thereby reducing and simplifying the composite cull code.

Then the code becomes:

 {-# OPTIONS -O3 -rtsopts #-} -- -fllvm ide.com doesn't support LLVM import Data.Word import Data.Bits import Control.Monad import Control.Monad.ST import Data.Array.ST (runSTUArray) import Data.Array.Unboxed import Data.Array.Base pgSZBTS = (2^18) * 8 :: Int -- size of L2 data cache type PrimeType = Word32 type Chunk = UArray PrimeType Bool -- makes a new page chunk and culls it -- if the base primes list provided is empty then -- it uses the current array as source (for zero page base primes) mkChnk :: Word32 -> [Chunk] -> Chunk mkChnk low bschnks = runSTUArray $ do let nxt = (fromIntegral low) + (fromIntegral pgSZBTS) buf <- nxt `seq` newArray (fromIntegral low, fromIntegral nxt - 1) True let cull ~(p:ps) = let bp = (fromIntegral p) i = (bp - 3) `shiftR` 1 s = bp * (i + 1) + i in let cullp j = do if j >= pgSZBTS then cull ps else do unsafeWrite buf j False cullp (j + (fromIntegral p)) in when (s < nxt) $ do let strt = do if s >= low then fromIntegral (s - low) else do let b = (low - s) `rem` bp if b == 0 then 0 else fromIntegral (bp - b) cullp strt case bschnks of [] -> do bsbf <- unsafeFreezeSTUArray buf cull (listChnkPrms [bsbf]) _ -> cull $ listChnkPrms bschnks return buf -- creates a page chunk list starting at the lw value chnksList :: Word32 -> [Chunk] chnksList lw = mkChnk lw basePrmChnks : chnksList (lw + fromIntegral pgSZBTS) -- converts a page chunk list to a list of primes listChnkPrms :: [Chunk] -> [PrimeType] listChnkPrms [] = [] listChnkPrms ~(hdchnk@(UArray lw _ rng _):tlchnks) = let nxtp i = if i >= rng then [] else if unsafeAt hdchnk i then (case ((lw + fromIntegral i) `shiftL` 1) + 3 of np -> np) : nxtp (i + 1) else nxtp (i + 1) in (hdchnk `seq` lw `seq` nxtp 0) ++ listChnkPrms tlchnks -- the base page chunk list used to cull the higher order pages, -- note that it has special treatment for the zero page. -- It is more space efficient to store this as chunks rather than -- as a list of primes or even a list of deltas (gaps), with the -- extra processing to convert as needed not too much. basePrmChnks :: [Chunk] basePrmChnks = mkChnk 0 [] : chnksList (fromIntegral pgSZBTS) -- the full list of primes could be accessed with the following function. primes :: () -> [PrimeType] primes () = 2 : (listChnkPrms $ chnksList 0) -- a quite fast prime counting up to the given limit using -- chunk processing to avoid innermost list processing loops. countPrimesTo :: PrimeType -> Int countPrimesTo limit = let lmtb = (limit - 3) `div` 2 in let sumChnks acc chnks@(chnk@(UArray lo hi rng _):chnks') = let cnt :: UArray PrimeType Word32 -> Int cnt bfw = case if lmtb < hi then fromIntegral (lmtb - lo) else rng of crng -> case crng `shiftR` 5 of rngw -> let cnt' i ac = ac `seq` if i >= rngw then if (i `shiftL` 5) >= rng then ac else case (-2) `shiftL` fromIntegral (lmtb .&. 31) of msk -> msk `seq` case (unsafeAt bfw rngw) .&. (complement msk) of bts -> bts `seq` case popCount bts of c -> c `seq` case ac + c of nac -> nac else case ac + (popCount $ unsafeAt bfw i) of nacc -> nacc `seq` cnt' (i + 1) (nacc) in cnt' 0 0 in acc `seq` case runST $ do -- make UArray _ Bool into a UArray _ Word32 stbuf <- unsafeThawSTUArray chnk stbufw <- castSTUArray stbuf bufw <- unsafeFreezeSTUArray stbufw return $ cnt bufw of c -> c `seq` case acc + c of nacc -> nacc `seq` if hi >= lmtb then nacc else sumChnks nacc chnks' in if limit < 2 then 0 else if limit < 3 then 1 else lmtb `seq` sumChnks 1 (chnksList 0) main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln 1000mln -- 0.02 0.03 0.06 0.45 4.60 seconds -- 7328 7328 8352 8352 9424 Kilobytes -- this takes 14.34 seconds and 9424 Kilobytes to 3 billion on ideone.com, -- and 9.12 seconds for 3 billion on an i7-2700K (3.5 GHz). -- The above ratio of about 1.6 is much better than usual due to -- the extremely low memory use of the page segmented algorithm. -- It seems thaat the Windows Native Code Generator (NCG) from GHC -- is particularly poor, as the Linux 32-bit version takes -- less than two thirds of the time for exactly the same program... print $ countPrimesTo x -- print $ length $ takeWhile ((>=) x) $ primes () -- the slow way to do this 

The time and memory requirements specified in the code are observed when running on ideone.com , with 0.02, 0.03, 0.05, 0.30, 3.0 and 9.1 seconds to work on my i7- 2700K (3.5 GHz) for one, two, ten, hundreds, thousands (one billion) and three thousand (three billion) millions, respectively, the constant memory volume gradually increases with an increase in the number of base numbers less than the square root of the range. When compiling with the final end of the LLVM compiler, this time becomes 0.01, 0.02, 0.02, 0.12, 1.35 and 4.15 seconds, respectively, due to more efficient use of registers and machine instructions; this latter is pretty close to the same speed as compiling with a 64-bit compiler, and not the 32-bit compiler used as an efficient use of registers, means that the availability of additional registers does not matter much.

As the code commented, the relationship between performance on my real machine and ideone.com servers becomes much less than for a lot more wasteful memory algorithms due to the fact that they are not throttled by memory access bottlenecks, so speed limits are basically simple CPU clock ratio, as well as CPU processing efficiency per cycle. However, as commented, there is some strange inefficiency with the native GHC code generator (NCG) when running under Windows (32-bit compiler), because the execution time is 50% slower than when running under Linux (like ideon .com uses a server). AFAIK Haskell GHC, ( LLVM, ), GHC NCG GCC, mingw32, .

, LLVM , , "C/++", , Haskell . , Haskell , "C/++", Haskell . , Haskell C/++.

: , Sache Eratosthenes Haskell, L1, , , . , , , , 100 . END_EDIT_ADD2

0
source share

All Articles