IO over large files in haskell: performance issue

I am trying to work with large files using Haskell. I would like to view the bytes of the input file after the byte and generate the output byte after the byte. Of course, I need the IO to be buffered by blocks of a reasonable size (several kilobytes). I cannot do this, and I need your help.

import System import qualified Data.ByteString.Lazy as BL import Data.Word import Data.List main :: IO () main = do args <- System.getArgs let filename = head args byteString <- BL.readFile filename let wordsList = BL.unpack byteString let foldFun acc word = doSomeStuff word : acc let wordsListCopy = foldl' foldFun [] wordsList let byteStringCopy = BL.pack (reverse wordsListCopy) BL.writeFile (filename ++ ".cpy") byteStringCopy where doSomeStuff = id 

I will name this file TestCopy.hs and then do the following:

 $ ls -l *MB -rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB -rwxrwxrwx 1 root root 5000000 2011-03-24 13:31 5MB $ ghc --make -O TestCopy.hs [1 of 1] Compiling Main ( TestCopy.hs, TestCopy.o ) Linking TestCopy ... $ time ./TestCopy 5MB real 0m5.631s user 0m1.972s sys 0m2.488s $ diff 5MB 5MB.cpy $ time ./TestCopy 10MB real 3m6.671s user 0m3.404s sys 1m21.649s $ diff 10MB 10MB.cpy $ time ./TestCopy 10MB +RTS -K500M -RTS real 2m50.261s user 0m3.808s sys 1m13.849s $ diff 10MB 10MB.cpy $ 

My problem: there is a huge difference between a 5 MB file and a 10 MB file. I would like the characteristics to be linear in size of the input file. Please, what am I doing wrong, and how can I achieve this? I don't mind using lazy bytes or anything else while it works, but it should be the standard ghc library.

Accuracy: this is for a university project. And I'm not trying to copy files. The doSomeStuff function should perform the compression / decompression actions that I have to configure.

+6
io haskell lazy-evaluation bytestring
source share
2 answers

I understand that this is the following: Reading a large file in haskell? since yesterday.

Try compiling with -rtsopts -O2 instead of -O.

You state: "I would like to look at the bytes of the input file after the byte and generate the output byte after the byte." but your code reads all the input before trying to create any output. This is not very indicative of the goal.

On my system, I see "ghc -rtsopts -make -O2 b.hs" giving

 (! 741)-> time ./b five real 0m2.789s user 0m2.235s sys 0m0.518s (! 742)-> time ./b ten real 0m5.830s user 0m4.623s sys 0m1.027s 

Which now looks linear to me.

+2
source share

For processing tagged input, I would use the enumerator package.

 import Data.Enumerator import Data.Enumerator.Binary (enumFile) 

We use bytestrings

 import Data.ByteString as BS 

and IO

 import Control.Monad.Trans (liftIO) import Control.Monad (mapM_) import System (getArgs) 

Your main function might look like this:

 main = do (filepath:_) <- getArgs let destination run_ $ enumFile filepath $$ writeFile (filepath ++ ".cpy") 

enumFile reads 4096 bytes per piece and passes them to writeFile, which writes it.

enumWrite is defined as:

 enumWrite :: FilePath -> Iteratee BS.ByteString IO () enumWrite filepath = do liftIO (BS.writeFile filepath BS.empty) -- ensure the destination is empty continue step where step (Chunks xs) = do liftIO (mapM_ (BS.appendFile filepath) xs) continue step step EOF = yield () EOF 

As you can see, the step function takes pieces of bytestrings and adds them to the destination file. These pieces are of type Stream BS.Bytestring, where Stream is defined as:

 data Stream a = Chunks [a] | EOF 

In step EOF completes, yielding ().

To do a more detailed reading, I personally recommend Michael Snoymans tutorial

Figures

 $ time ./TestCopy 5MB ./TestCopy 5MB 2,91s user 0,32s system 96% cpu 3,356 total $ time ./TestCopy2 5MB ./TestCopy2 5MB 0,04s user 0,03s system 93% cpu 0,075 total 

It has completely improved. Now, to realize your fold, you probably want to write Enumeratee, which is used to convert the input stream. Fortunately, there is already a display function defined in the enumerator package that can be changed for your needs, i.e. It can be changed to transfer state.

On the construction of an intermediate result

You build the wordList in reverse order and then change it. I think difference lists work better because appends only accept O (1) time due to the fact that adding is only a function. I'm not sure they take up more space. Here is an approximate sketch of difference lists:

 type DList a = [a] -> [a] emptyList :: DList a emptyList = id snoc :: DList a -> a -> DList a snoc dlist a = dlist . (a:) toList :: DList a -> [a] toList dlist = dlist [] 

This answer is probably no longer needed, but I added it for completeness.

+8
source share

All Articles