How to reorganize this random Haskell byte output?

I am trying to create random data with fast speed inside Haskell, but when I try to use any idiomatic approach, I get low speed and big GC overhead.

Here is a short code:

import qualified System.Random.Mersenne as RM import qualified Data.ByteString.Lazy as BL import qualified System.IO as SI import Data.Word main = do r <- RM.newMTGen Nothing :: IO RM.MTGen rnd <- RM.randoms r :: IO [Word8] BL.hPutStr SI.stdout $ BL.pack rnd 

Here is a quick code:

 import qualified System.Random.Mersenne as RM import qualified Data.ByteString as B import qualified Data.ByteString.Lazy as BL import qualified Data.Binary.Put as DBP import qualified System.IO as SI import Data.List import Control.Monad (void, forever) import Data.Word main = do r <- RM.newMTGen Nothing :: IO RM.MTGen forever $ do x0 <- RM.random r :: IO Word32 x1 <- RM.random r :: IO Word32 x2 <- RM.random r :: IO Word32 x3 <- RM.random r :: IO Word32 x4 <- RM.random r :: IO Word32 x5 <- RM.random r :: IO Word32 x6 <- RM.random r :: IO Word32 x7 <- RM.random r :: IO Word32 x8 <- RM.random r :: IO Word32 x9 <- RM.random r :: IO Word32 xA <- RM.random r :: IO Word32 xB <- RM.random r :: IO Word32 xC <- RM.random r :: IO Word32 xD <- RM.random r :: IO Word32 xE <- RM.random r :: IO Word32 xF <- RM.random r :: IO Word32 c0 <- RM.random r :: IO Word32 c1 <- RM.random r :: IO Word32 c2 <- RM.random r :: IO Word32 c3 <- RM.random r :: IO Word32 c4 <- RM.random r :: IO Word32 c5 <- RM.random r :: IO Word32 c6 <- RM.random r :: IO Word32 c7 <- RM.random r :: IO Word32 c8 <- RM.random r :: IO Word32 c9 <- RM.random r :: IO Word32 cA <- RM.random r :: IO Word32 cB <- RM.random r :: IO Word32 cC <- RM.random r :: IO Word32 cD <- RM.random r :: IO Word32 cE <- RM.random r :: IO Word32 cF <- RM.random r :: IO Word32 v0 <- RM.random r :: IO Word32 v1 <- RM.random r :: IO Word32 v2 <- RM.random r :: IO Word32 v3 <- RM.random r :: IO Word32 v4 <- RM.random r :: IO Word32 v5 <- RM.random r :: IO Word32 v6 <- RM.random r :: IO Word32 v7 <- RM.random r :: IO Word32 v8 <- RM.random r :: IO Word32 v9 <- RM.random r :: IO Word32 vA <- RM.random r :: IO Word32 vB <- RM.random r :: IO Word32 vC <- RM.random r :: IO Word32 vD <- RM.random r :: IO Word32 vE <- RM.random r :: IO Word32 vF <- RM.random r :: IO Word32 b0 <- RM.random r :: IO Word32 b1 <- RM.random r :: IO Word32 b2 <- RM.random r :: IO Word32 b3 <- RM.random r :: IO Word32 b4 <- RM.random r :: IO Word32 b5 <- RM.random r :: IO Word32 b6 <- RM.random r :: IO Word32 b7 <- RM.random r :: IO Word32 b8 <- RM.random r :: IO Word32 b9 <- RM.random r :: IO Word32 bA <- RM.random r :: IO Word32 bB <- RM.random r :: IO Word32 bC <- RM.random r :: IO Word32 bD <- RM.random r :: IO Word32 bE <- RM.random r :: IO Word32 bF <- RM.random r :: IO Word32 BL.hPutStr SI.stdout $ DBP.runPut $ do DBP.putWord32be x0 DBP.putWord32be x1 DBP.putWord32be x2 DBP.putWord32be x3 DBP.putWord32be x4 DBP.putWord32be x5 DBP.putWord32be x6 DBP.putWord32be x7 DBP.putWord32be x8 DBP.putWord32be x9 DBP.putWord32be xA DBP.putWord32be xB DBP.putWord32be xC DBP.putWord32be xD DBP.putWord32be xE DBP.putWord32be xF DBP.putWord32be c0 DBP.putWord32be c1 DBP.putWord32be c2 DBP.putWord32be c3 DBP.putWord32be c4 DBP.putWord32be c5 DBP.putWord32be c6 DBP.putWord32be c7 DBP.putWord32be c8 DBP.putWord32be c9 DBP.putWord32be cA DBP.putWord32be cB DBP.putWord32be cC DBP.putWord32be cD DBP.putWord32be cE DBP.putWord32be cF DBP.putWord32be v0 DBP.putWord32be v1 DBP.putWord32be v2 DBP.putWord32be v3 DBP.putWord32be v4 DBP.putWord32be v5 DBP.putWord32be v6 DBP.putWord32be v7 DBP.putWord32be v8 DBP.putWord32be v9 DBP.putWord32be vA DBP.putWord32be vB DBP.putWord32be vC DBP.putWord32be vD DBP.putWord32be vE DBP.putWord32be vF DBP.putWord32be b0 DBP.putWord32be b1 DBP.putWord32be b2 DBP.putWord32be b3 DBP.putWord32be b4 DBP.putWord32be b5 DBP.putWord32be b6 DBP.putWord32be b7 DBP.putWord32be b8 DBP.putWord32be b9 DBP.putWord32be bA DBP.putWord32be bB DBP.putWord32be bC DBP.putWord32be bD DBP.putWord32be bE DBP.putWord32be bF 

The short code outputs about 6 megabytes of random bytes per second on my computer. Fast code is about 150 megabytes per second.

If I reduce the number of these variables from 64 to 16 in fast code, the speed drops to about 78 megabytes per second.

How to make this code compact and idiomatic without slowing it down?

+7
performance io random haskell
source share
2 answers

I don’t think that lazy IO is considered very idiomatic in Haskell today. It can work for single-line lines, but for large intensive IO-programmers haskellers use iterations / conduits / pipes / Oleg-knows-what.

Firstly, to make a checkpoint, some statistics on running your source versions on my computer, compiled with GHC 7.6.3 ( -O2 --make ), on Linux x86-64. Slow lazy version:

 $ ./rnd +RTS -s | pv | head -c 100M > /dev/null 100MB 0:00:09 [10,4MB/s] [ <=> ] 6,843,934,360 bytes allocated in the heap 2,065,144 bytes copied during GC 68,000 bytes maximum residency (2 sample(s)) 18,016 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 99.2% of total user, 97.7% of total elapsed 

It is not incredibly fast, but it has no GC and memory. I wonder how and where you got 37% GC of time with this code.

Fast version with deployed loops:

 $ ./rndfast +RTS -s | pv | head -c 500M > /dev/null 500MB 0:00:04 [ 110MB/s] [ <=> ] 69,434,953,224 bytes allocated in the heap 9,225,128 bytes copied during GC 68,000 bytes maximum residency (2 sample(s)) 18,016 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 85.0% of total user, 72.7% of total elapsed 

This is much faster, but interestingly, we now have 15% of the GC overhead.

And finally, my version using conduits and flame builders. It generates 512 random Word64 at a time to release 4 KB of data that will be consumed downstream. Performance steadily increased as the list buffer size increased from 32 to 512, but improvements were slightly higher than 128.

 import Blaze.ByteString.Builder (Builder) import Blaze.ByteString.Builder.Word import Control.Monad (forever) import Control.Monad.IO.Class (liftIO) import Data.ByteString (ByteString) import Data.Conduit import qualified Data.Conduit.Binary as CB import Data.Conduit.Blaze (builderToByteString) import Data.Word import System.IO (stdout) import qualified System.Random.Mersenne as RM randomStream :: RM.MTGen -> Source IO Builder randomStream gen = forever $ do words <- liftIO $ RM.randoms gen yield $ fromWord64shost $ take 512 words main :: IO () main = do gen <- RM.newMTGen Nothing randomStream gen $= builderToByteString $$ CB.sinkHandle stdout 

I noticed that unlike the above two programs, it is slightly (3-4%) faster when compiling with -fllvm , so the output is lower from the binary created by LLVM 3.3.

 $ ./rndconduit +RTS -s | pv | head -c 500M > /dev/null 500MB 0:00:09 [53,2MB/s] [ <=> ] 8,889,236,736 bytes allocated in the heap 10,912,024 bytes copied during GC 36,376 bytes maximum residency (2 sample(s)) 19,024 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 99.0% of total user, 91.9% of total elapsed 

Thus, it is twice as slow as the manual deployed version, but almost as short and readable as the lazy I / O version, it has almost no overhead for the GC and predictable memory behavior. Perhaps there is room for improvement: comments are welcome.

UPDATE:

By combining some unsafe byte with channels, I was able to create a program that generates 300+ MB / s of random data. It seems that simple specialized tail-recursive functions work better than lazy lists and manual expansion.

 import Control.Monad (forever) import Control.Monad.IO.Class (liftIO) import Data.ByteString (ByteString) import qualified Data.ByteString as B import Data.Conduit import qualified Data.Conduit.Binary as CB import Data.Word import Foreign.Marshal.Array import Foreign.Ptr import Foreign.Storable import System.IO (stdout) import qualified System.Random.Mersenne as RM randomChunk :: RM.MTGen -> Int -> IO ByteString randomChunk gen bufsize = allocaArray bufsize $ \ptr -> do loop ptr bufsize B.packCStringLen (castPtr ptr, bufsize * sizeOf (undefined :: Word64)) where loop :: Ptr Word64 -> Int -> IO () loop ptr 0 = return () loop ptr n = do x <- RM.random gen pokeElemOff ptr nx loop ptr (n - 1) chunkStream :: RM.MTGen -> Source IO ByteString chunkStream gen = forever $ liftIO (randomChunk gen 512) >>= yield main :: IO () main = do gen <- RM.newMTGen Nothing chunkStream gen $$ CB.sinkHandle stdout 

At this speed, the IO overhead actually becomes noticeable: the program spends more than a quarter of its execution time in system calls, and adding head to the pipeline, as in the examples above, slows it down significantly.

 $ ./rndcond +RTS -s | pv > /dev/null ^C27GB 0:00:10 [ 338MB/s] [ <=> ] 8,708,628,512 bytes allocated in the heap 1,646,536 bytes copied during GC 36,168 bytes maximum residency (2 sample(s)) 17,080 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) ... Productivity 98.7% of total user, 73.6% of total elapsed 
+9
source share

I can confirm that the second version is slower than the first, but not to the same extent. After 10 seconds, the short code generated 111M data, and the large code generated 833M data. This was done on Mac OSX Lion, compiled with 7.6.3 s -O3.

Although I don’t know why the first solution is so slow, the second can be simplified using replicateM and mapM to remove duplication:

 main3 = do r <- RM.newMTGen Nothing :: IO RM.MTGen forever $ do vals <- sequence $ replicate 64 (RM.random r) BL.hPutStr SI.stdout $ DBP.runPut $ mapM_ DBP.putWord32be vals 

This solution is still slower, although 492M data is generated in 10 seconds. The final final decision is to use the haskell pattern to generate code for looping:

 main4 = do r <- RM.newMTGen Nothing :: IO RM.MTGen forever $ do $(let varCount = 64 -- | replaces every instance of oldName with newName in the exp replaceNames :: (Typeable t, Data t) => String -> Name -> t -> t replaceNames oldName replacementName expr = everywhere (mkT changeName) expr where changeName name | nameBase name == oldName = replacementName | otherwise = name singleVarExp :: Name -> ExpQ -> ExpQ singleVarExp varName next = replaceNames "patternvar" varName <$> [| RM.random r >>= \patternvar -> $(next) |] allVarExps :: [Name] -> ExpQ -> ExpQ allVarExps (n:ns) next = foldr (\var result -> singleVarExp var result) (singleVarExp n next) ns singleOutputter :: Name -> ExpQ -> ExpQ singleOutputter varName next = [| DBP.putWord32be $(varE varName) >> $(next) |] allVarOutput :: [Name] -> ExpQ allVarOutput (n:ns) = foldr (\var result -> singleOutputter var result) (singleOutputter n [| return () |]) ns printResultExp :: [Name] -> ExpQ printResultExp names = [| BL.hPutStr SI.stdout $ DBP.runPut ($(allVarOutput names)) |] result = do vars <- replicateM varCount $ newName "x" allVarExps vars (printResultExp vars) in result) 

This works at about the same speed as the original fast version. This is not very neat (your quick solution is easier to read), but now you can easily change the number of variables and still have an expanded loop. I tried 512, but besides the compilation time was huge, it didn’t seem to have much impact on performance.

+2
source share

All Articles