How to get good performance when writing a list of integers from 1 to 10 million per file?

Question

I need a program that will write a sequence, for example,

1 ... 10000000 

to file. What is the simplest code you can write and get decent performance? My intuition is that there is some problem of lack of buffering. My C code works at a speed of 100 MB / s, while by reference the Linux dd command-line tool runs at 9 GB / s 3 GB / s (sorry for the inaccuracy, see Comments - However, they are more interested in large order sizes )

One would think that now this will solve the problem ... that is, any modern compiler will immediately create such programs that will perform well enough ...

C code

 #include <stdio.h> int main(int argc, char **argv) { int len = 10000000; for (int a = 1; a <= len; a++) { printf ("%d\n", a); } return 0; } 

I am compiling with clang -O3 . The performance skeleton, which calls putchar('\n') 8 times, gets comparable performance.

Haskell Code

A nice implementation of Haskell runs at 13 MB / s, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields . (I did not recompile my libraries with -fllvm , maybe I need to do this.) Code:

 import Control.Monad main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j) 

My best hit with Haskell is even slower - 17 MB / s. The problem is that I cannot find a good way to convert Vector to ByteString (maybe there is a solution using iteratees?).

 import qualified Data.Vector.Unboxed as V import Data.Vector.Unboxed (Vector, Unbox, (!)) writeVector :: (Unbox a, Show a) => Vector a -> IO () writeVector v = V.mapM_ (System.IO.putStrLn . show) v main = writeVector (V.generate 10000000 id) 

It seems that writing a ByteString is fast , as shown in this code by writing an equivalent number of characters,

 import Data.ByteString.Char8 as B main = B.putStrLn (B.replicate 76000000 '\n') 

This gets 1.3 GB / s, which is not as fast as dd , but obviously much better.

+7
source share
3 answers

First, completely unscientific benchmarking will begin first:

All programs were compiled with the default optimization level (-O3 for gcc, -O2 for GHC) and run with

 time ./prog > outfile 

As a baseline, program C took 1.07 seconds to create a ~ 76 MB file (78888897 bytes), approximately 70 MB / s.

  • The "naive" Haskell program ( forM [1 .. 10000000] $ \j -> putStrLn (show j) ) took 8.64 s, about 8.8 MB / s.
  • The same thing with forM_ instead of forM took 5.64s, about 13.5MB / s.
  • The ByteString version from ByteString 's answer took 9.13 s, about 8.3 MB / s.
  • The Text version from dflemstr's answer took 5.64 s, about 13.5 MB / s.
  • The Vector version of the question took 5.54s, about 13.7MB / s.
  • main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000] , where C is Data.ByteString.Char8 , it took 4.25s, about 17.9MB / s.
  • putStr . unlines . map show $ [1 :: Int .. 10000000] putStr . unlines . map show $ [1 :: Int .. 10000000] took 3.06s, about 24.8MB / s.
  • Hand loop

     main = putStr $ go 1 where go :: Int -> String go i | i > 10000000 = "" | otherwise = shows i . showChar '\n' $ go (i+1) 

    - 2.32 s, about 32.75 MB / s.

  • main = putStrLn $ replicate 78888896 'a' took 1.15 s, about 66 MB / s.
  • main = C.putStrLn $ C.replicate 78888896 'a' , where C is Data.ByteString.Char8 , took 0.143 s, about 530 MB / s, about the same numbers for lazy ByteString s.

What can we learn from this?

First, do not use forM or mapM unless you really want to collect the results. Damage, it sucks.

Then the ByteString output can be very fast (10.), but if building a ByteString for the output is slow (3.), you will get a slower code than the naive String output.

What is so terrible in 3.? Well, all String involved are very short. So you get a list

 Chunk "1234567" Empty 

and between any two of these, and Chunk "\n" Empty is placed, the resulting list is combined, which means that all these Empty discarded during construction ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...)))) . That there is a lot of wasteful construction-deconstruction-reconstruction. A speed comparable to the Text version and the fixed "naive" String can be achieved using pack to strict ByteString and using fromChunks (and Data.List.intersperse for Data.List.intersperse ). Better performance, slightly better than 6., can be obtained by eliminating costly singletones. If you stick newline characters to String s using \k -> shows k "\n" instead of show , the concatenation should be half the length slightly larger than the ByteString s that is being calculated.

I am not familiar enough with inner texts or vectors to offer a more than semi-educated assumption about the reasons for the observed performance, so I will not leave them alone. Suffice it to say that performance enhancement is, at best, minimal compared to the fixed, naive version of String .

Now, 6. shows that the output of ByteString faster than String , enough that in this case the extra work of pack ing is more than compensated. However, do not be fooled by this in order to believe that this is always so. If the String package is long, packaging may take longer than the String output.

But ten million putStrLn calls, whether it is a String or ByteString , take a lot of time. It’s faster to grab stdout Handle only once and build String output in code other than IO. unlines already working well, but we still suffer from the map show [1 .. 10^7] list design map show [1 .. 10^7] . Unfortunately, the compiler could not fix it (but it fixed [1 .. 10^7] , which is already very good). So do it yourself, bringing it to 8. It's not too scary, but it still takes more than two times more than program C.

You can make a faster Haskell program by going through the low-level and direct filling of a ByteString , not going through the String through show , but I don’t know if the speed C is available. Anyway, this low-level code is not very beautiful, so I will spare you what I have, but sometimes I have to deal with dirty hands, if speed matters.

+7
source

Using lazy byte strings gives you some buffering, because the string will be written instantly, and more numbers will only be generated as needed. This code shows the basic idea (maybe some optimizations can be made):

 import qualified Data.ByteString.Lazy.Char8 as ByteString main = ByteString.putStrLn . ByteString.intercalate (ByteString.singleton '\n') . map (ByteString.pack . show) $ ([1..10000000] :: [Int]) 

I still use String for numbers here, which leads to terrible slowdowns. If we switch to text instead of bytestring , we get access to the "native" show functions for ints and can do this:

 import Data.Monoid import Data.List import Data.Text.Lazy.IO as Text import Data.Text.Lazy.Builder as Text import Data.Text.Lazy.Builder.Int as Text main :: IO () main = Text.putStrLn . Text.toLazyText . mconcat . intersperse (Text.singleton '\n') . map Text.decimal $ ([1..10000000] :: [Int]) 

I don’t know how you measure the β€œspeed” of these programs (using the pv tool?), But I believe that one of these procedures will be the fastest trivial program you can get.

+3
source

If you are going to achieve maximum performance, this will help to make a holistic look; those. you want to write a function that maps from [Int] to a series of system calls that write pieces of memory to a file.

Lazy bytes are a good representation for a sequence of pieces of memory. Matching a lazy byte to a series of system calls that write pieces of memory is what L.hPut does (assuming import qualified Data.ByteString.Lazy as L ). Therefore, we just need a means to efficiently create the corresponding lazy byte sequence. This is what lazy developers strive for. With the new bytestring builder (here's the API documentation ), the following code does the job.

 import qualified Data.ByteString.Lazy as L import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8) import Data.ByteString.Lazy.Builder.ASCII (intDec) import Data.Foldable (foldMap) import Data.Monoid (mappend) import System.IO (openFile, IOMode(..)) main :: IO () main = do h <- openFile "/dev/null" WriteMode L.hPut h $ toLazyByteString $ foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000] 

Please note that I output to /dev/null to avoid interference with the disk driver. The effort to move data in the OS remains unchanged. On my machine, the code above works in 0.45 seconds, which is 12 times faster than 5.4 seconds of your source code. This implies a bandwidth of 168 MB / s. We can squeeze out an additional 30% speed (220 MB / s) using limited encodings].

 import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E L.hPut h $ toLazyByteString $ E.encodeListWithB ((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8) [1..10000000] 

Their syntax looks a bit bizarre because BoundedEncoding a indicates converting a Haskell value of type a to a sequence of bytes with a limited length, so the score can be computed at compile time. This allows functions such as E.encodeListWithB to perform some additional optimizations to implement the actual buffer fill. For more information, see the Data.ByteString.Lazy.Builder.BasicEncoding Documentation in the above API Documentation link (phew, dumb hyperlink limit for new users).

Here is the source of all my tests .

The conclusion is that we can get very good performance from a declarative solution, provided that we understand the cost model of our implementation and use the correct data structures. Whenever you create a packed sequence of values ​​(for example, a sequence of bytes represented as bytestring), then the correct data structure to use is the Builder byte value.

+3
source

All Articles