As said, using mutable arrays has better performance. The following code is derived from this "TemplateHaskell" version , converted back to something more according to Array's direct mutable solution, since "TemplateHaskell" doesn't seem to have any meaning, with some further optimizations. I believe this is faster than regular modified versions of the unboxed array due to further optimizations, and especially due to the use of the unsafeRead and unsafeWrite functions, which avoid checking the range of arrays, possibly also internally using pointers to access array:
{-# OPTIONS -O2 -optc-O3 #-} import Control.Monad import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed import Data.Array.Base primesToUA :: Word32-> [Word32] primesToUA top = do let sieveUA top = runSTUArray $ do let m = ((fromIntegral top) - 3) `div` 2 :: Int buf <- newArray (0,m) True -- :: ST s (STUArray s Int Bool) let cullUA i = do let p = i + i + 3 strt = p * (i + 1) + i let cull j = do if j > m then cullUA (i + 1) else do unsafeWrite buf j False cull (j + p) if strt > m then return () else do e <- unsafeRead buf i if e then cull strt else cullUA (i + 1) cullUA 0; return buf if top > 1 then 2 : [2 * (fromIntegral i) + 3 | (i,True) <- assocs $ sieveUA top] else [] main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln print (last (primesToUA x)) -- 0.01 0.02 0.09 1.26 seconds
EDIT: the code above has been fixed, and the edited time has changed to reflect the correction, as well as edited comments comparing the downloaded pages with the non-downloadable version.
The times to run this with the upper ranges indicated are given in the comment table at the bottom of the source code above, as measured by ideone.com and about exactly five times faster than the response posted by @WillNess, also measured on ideone.com . This code takes a trivial amount of time to select primes up to two million and only 1.26 seconds to drop one hundred million. These times are about 2.86 times faster when I launch my i7 (3.5 GHz) from 0.44 seconds to one hundred million and takes 6.81 seconds to work up to one billion. The memory usage is more than six megabytes for the first and sixty megabytes for the latter, which is the memory used by the huge (bit-by-bit) array. This array also explains non-linear performance in that, since the size of the array exceeds the size of the CPU cache, the average memory access time degrades when displaying a collection of numbers.
EDIT_ADD: Segmented section of a page is more efficient because it has better memory access efficiency when the buffer size is less than the caching of the L1 or L2 processor, and also has the advantage that it is unlimited, so the upper range need not be specified in advance , and a much smaller amount of memory is only basic primes smaller than the square root of the range used plus page buffer memory. The following code was written as a segmented page and is somewhat faster than the version without the page; it also gives the advantage that you can change the specification of the output range at the top of the code to "Word64" from "Word32", so it is not limited to the 32-bit range of numbers with little processing time (for 32-bit compiled code) for any a range that is common. The code is as follows:
-- from http://www.haskell.org/haskellwiki/Prime_numbers
The above code contains several more lines of code than the case when it is not unloaded due to the need to select representations of composite numbers from the first array of pages differently than subsequent pages. The code also has corrections, so there are no memory leaks due to the list of basic primes, and the output list is now not the same list (thus avoiding keeping the entire list in memory).
Note that this code is approaching linear time (over a numbered range) as the range becomes larger due to the reject buffer being a constant size smaller than the L2 processor cache. Memory usage is part of what is used by the unguarded version at less than 600 kilobytes per hundred million and a little over 600 kilobytes per billion, which a slight increase is just the extra space needed for basic primes less than the square root of the range list.
At ideone.com, this code generates up to one hundred million primes in 1.13 seconds and from 12 seconds to one billion (32-bit). It is likely that factorization of the wheels and a certain multi-core processing will make it even faster on a multi-core processor. On my i7 (3.5 GHz), it takes 0.44 seconds to sift up to one hundred million and from 4.7 seconds to one billion, with approximately linear performance with an increase in range, as expected. It seems that the GHC version on ideone.com has some non-linear overhead, which has some performance limitation for larger ranges, which is not the case for i7, and this may be due to different garbage collection, as page buffers are created new for each new pages. END_EDIT_ADD
EDIT_ADD2: It seems that most of the processing time for the above segment of the segmented code is used in processing the (lazy) list, so the code is accordingly reformulated with several improvements as follows:
A simple count function is implemented that does not use list processing and uses the "popCount" table to count the number of "one" bits in a 32-bit word at a time. Thus, the time to search for results is negligible compared to the actual screening time.
Basic primes are stored as a list of bitmap page segments, which is much more efficient than storing a list of primes, and the time it takes to convert page segments to primes is not significant computational overhead.
Set up the main function of the segment so that for the initial segment of the zero page it uses its own bit pattern as the source page, thereby reducing and simplifying the composite cull code.
Then the code becomes:
{-# OPTIONS -O3 -rtsopts #-} -- -fllvm ide.com doesn't support LLVM import Data.Word import Data.Bits import Control.Monad import Control.Monad.ST import Data.Array.ST (runSTUArray) import Data.Array.Unboxed import Data.Array.Base pgSZBTS = (2^18) * 8 :: Int -- size of L2 data cache type PrimeType = Word32 type Chunk = UArray PrimeType Bool -- makes a new page chunk and culls it -- if the base primes list provided is empty then -- it uses the current array as source (for zero page base primes) mkChnk :: Word32 -> [Chunk] -> Chunk mkChnk low bschnks = runSTUArray $ do let nxt = (fromIntegral low) + (fromIntegral pgSZBTS) buf <- nxt `seq` newArray (fromIntegral low, fromIntegral nxt - 1) True let cull ~(p:ps) = let bp = (fromIntegral p) i = (bp - 3) `shiftR` 1 s = bp * (i + 1) + i in let cullp j = do if j >= pgSZBTS then cull ps else do unsafeWrite buf j False cullp (j + (fromIntegral p)) in when (s < nxt) $ do let strt = do if s >= low then fromIntegral (s - low) else do let b = (low - s) `rem` bp if b == 0 then 0 else fromIntegral (bp - b) cullp strt case bschnks of [] -> do bsbf <- unsafeFreezeSTUArray buf cull (listChnkPrms [bsbf]) _ -> cull $ listChnkPrms bschnks return buf -- creates a page chunk list starting at the lw value chnksList :: Word32 -> [Chunk] chnksList lw = mkChnk lw basePrmChnks : chnksList (lw + fromIntegral pgSZBTS) -- converts a page chunk list to a list of primes listChnkPrms :: [Chunk] -> [PrimeType] listChnkPrms [] = [] listChnkPrms ~(hdchnk@(UArray lw _ rng _):tlchnks) = let nxtp i = if i >= rng then [] else if unsafeAt hdchnk i then (case ((lw + fromIntegral i) `shiftL` 1) + 3 of np -> np) : nxtp (i + 1) else nxtp (i + 1) in (hdchnk `seq` lw `seq` nxtp 0) ++ listChnkPrms tlchnks -- the base page chunk list used to cull the higher order pages, -- note that it has special treatment for the zero page. -- It is more space efficient to store this as chunks rather than -- as a list of primes or even a list of deltas (gaps), with the -- extra processing to convert as needed not too much. basePrmChnks :: [Chunk] basePrmChnks = mkChnk 0 [] : chnksList (fromIntegral pgSZBTS) -- the full list of primes could be accessed with the following function. primes :: () -> [PrimeType] primes () = 2 : (listChnkPrms $ chnksList 0) -- a quite fast prime counting up to the given limit using -- chunk processing to avoid innermost list processing loops. countPrimesTo :: PrimeType -> Int countPrimesTo limit = let lmtb = (limit - 3) `div` 2 in let sumChnks acc chnks@(chnk@(UArray lo hi rng _):chnks') = let cnt :: UArray PrimeType Word32 -> Int cnt bfw = case if lmtb < hi then fromIntegral (lmtb - lo) else rng of crng -> case crng `shiftR` 5 of rngw -> let cnt' i ac = ac `seq` if i >= rngw then if (i `shiftL` 5) >= rng then ac else case (-2) `shiftL` fromIntegral (lmtb .&. 31) of msk -> msk `seq` case (unsafeAt bfw rngw) .&. (complement msk) of bts -> bts `seq` case popCount bts of c -> c `seq` case ac + c of nac -> nac else case ac + (popCount $ unsafeAt bfw i) of nacc -> nacc `seq` cnt' (i + 1) (nacc) in cnt' 0 0 in acc `seq` case runST $ do -- make UArray _ Bool into a UArray _ Word32 stbuf <- unsafeThawSTUArray chnk stbufw <- castSTUArray stbuf bufw <- unsafeFreezeSTUArray stbufw return $ cnt bufw of c -> c `seq` case acc + c of nacc -> nacc `seq` if hi >= lmtb then nacc else sumChnks nacc chnks' in if limit < 2 then 0 else if limit < 3 then 1 else lmtb `seq` sumChnks 1 (chnksList 0) main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln 1000mln -- 0.02 0.03 0.06 0.45 4.60 seconds -- 7328 7328 8352 8352 9424 Kilobytes -- this takes 14.34 seconds and 9424 Kilobytes to 3 billion on ideone.com, -- and 9.12 seconds for 3 billion on an i7-2700K (3.5 GHz). -- The above ratio of about 1.6 is much better than usual due to -- the extremely low memory use of the page segmented algorithm. -- It seems thaat the Windows Native Code Generator (NCG) from GHC -- is particularly poor, as the Linux 32-bit version takes -- less than two thirds of the time for exactly the same program... print $ countPrimesTo x -- print $ length $ takeWhile ((>=) x) $ primes () -- the slow way to do this
The time and memory requirements specified in the code are observed when running on ideone.com , with 0.02, 0.03, 0.05, 0.30, 3.0 and 9.1 seconds to work on my i7- 2700K (3.5 GHz) for one, two, ten, hundreds, thousands (one billion) and three thousand (three billion) millions, respectively, the constant memory volume gradually increases with an increase in the number of base numbers less than the square root of the range. When compiling with the final end of the LLVM compiler, this time becomes 0.01, 0.02, 0.02, 0.12, 1.35 and 4.15 seconds, respectively, due to more efficient use of registers and machine instructions; this latter is pretty close to the same speed as compiling with a 64-bit compiler, and not the 32-bit compiler used as an efficient use of registers, means that the availability of additional registers does not matter much.
As the code commented, the relationship between performance on my real machine and ideone.com servers becomes much less than for a lot more wasteful memory algorithms due to the fact that they are not throttled by memory access bottlenecks, so speed limits are basically simple CPU clock ratio, as well as CPU processing efficiency per cycle. However, as commented, there is some strange inefficiency with the native GHC code generator (NCG) when running under Windows (32-bit compiler), because the execution time is 50% slower than when running under Linux (like ideon .com uses a server). AFAIK Haskell GHC, ( LLVM, ), GHC NCG GCC, mingw32, .
, LLVM , , "C/++", , Haskell . , Haskell , "C/++", Haskell . , Haskell C/++.
: , Sache Eratosthenes Haskell, L1, , , . , , , , 100 . END_EDIT_ADD2