I don’t think that lazy IO is considered very idiomatic in Haskell today. It can work for single-line lines, but for large intensive IO-programmers haskellers use iterations / conduits / pipes / Oleg-knows-what.
Firstly, to make a checkpoint, some statistics on running your source versions on my computer, compiled with GHC 7.6.3 ( -O2 --make ), on Linux x86-64. Slow lazy version:
$ ./rnd +RTS -s | pv | head -c 100M > /dev/null 100MB 0:00:09 [10,4MB/s] [ <=> ] 6,843,934,360 bytes allocated in the heap 2,065,144 bytes copied during GC 68,000 bytes maximum residency (2 sample(s)) 18,016 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 99.2% of total user, 97.7% of total elapsed
It is not incredibly fast, but it has no GC and memory. I wonder how and where you got 37% GC of time with this code.
Fast version with deployed loops:
$ ./rndfast +RTS -s | pv | head -c 500M > /dev/null 500MB 0:00:04 [ 110MB/s] [ <=> ] 69,434,953,224 bytes allocated in the heap 9,225,128 bytes copied during GC 68,000 bytes maximum residency (2 sample(s)) 18,016 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 85.0% of total user, 72.7% of total elapsed
This is much faster, but interestingly, we now have 15% of the GC overhead.
And finally, my version using conduits and flame builders. It generates 512 random Word64 at a time to release 4 KB of data that will be consumed downstream. Performance steadily increased as the list buffer size increased from 32 to 512, but improvements were slightly higher than 128.
import Blaze.ByteString.Builder (Builder) import Blaze.ByteString.Builder.Word import Control.Monad (forever) import Control.Monad.IO.Class (liftIO) import Data.ByteString (ByteString) import Data.Conduit import qualified Data.Conduit.Binary as CB import Data.Conduit.Blaze (builderToByteString) import Data.Word import System.IO (stdout) import qualified System.Random.Mersenne as RM randomStream :: RM.MTGen -> Source IO Builder randomStream gen = forever $ do words <- liftIO $ RM.randoms gen yield $ fromWord64shost $ take 512 words main :: IO () main = do gen <- RM.newMTGen Nothing randomStream gen $= builderToByteString $$ CB.sinkHandle stdout
I noticed that unlike the above two programs, it is slightly (3-4%) faster when compiling with -fllvm , so the output is lower from the binary created by LLVM 3.3.
$ ./rndconduit +RTS -s | pv | head -c 500M > /dev/null 500MB 0:00:09 [53,2MB/s] [ <=> ] 8,889,236,736 bytes allocated in the heap 10,912,024 bytes copied during GC 36,376 bytes maximum residency (2 sample(s)) 19,024 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 99.0% of total user, 91.9% of total elapsed
Thus, it is twice as slow as the manual deployed version, but almost as short and readable as the lazy I / O version, it has almost no overhead for the GC and predictable memory behavior. Perhaps there is room for improvement: comments are welcome.
UPDATE:
By combining some unsafe byte with channels, I was able to create a program that generates 300+ MB / s of random data. It seems that simple specialized tail-recursive functions work better than lazy lists and manual expansion.
import Control.Monad (forever) import Control.Monad.IO.Class (liftIO) import Data.ByteString (ByteString) import qualified Data.ByteString as B import Data.Conduit import qualified Data.Conduit.Binary as CB import Data.Word import Foreign.Marshal.Array import Foreign.Ptr import Foreign.Storable import System.IO (stdout) import qualified System.Random.Mersenne as RM randomChunk :: RM.MTGen -> Int -> IO ByteString randomChunk gen bufsize = allocaArray bufsize $ \ptr -> do loop ptr bufsize B.packCStringLen (castPtr ptr, bufsize * sizeOf (undefined :: Word64)) where loop :: Ptr Word64 -> Int -> IO () loop ptr 0 = return () loop ptr n = do x <- RM.random gen pokeElemOff ptr nx loop ptr (n - 1) chunkStream :: RM.MTGen -> Source IO ByteString chunkStream gen = forever $ liftIO (randomChunk gen 512) >>= yield main :: IO () main = do gen <- RM.newMTGen Nothing chunkStream gen $$ CB.sinkHandle stdout
At this speed, the IO overhead actually becomes noticeable: the program spends more than a quarter of its execution time in system calls, and adding head to the pipeline, as in the examples above, slows it down significantly.
$ ./rndcond +RTS -s | pv > /dev/null ^C27GB 0:00:10 [ 338MB/s] [ <=> ] 8,708,628,512 bytes allocated in the heap 1,646,536 bytes copied during GC 36,168 bytes maximum residency (2 sample(s)) 17,080 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 98.7% of total user, 73.6% of total elapsed