How to reduce garbage collection while parsing a binary in Haskell?

I am writing a program to analyze .TGA files using Haskell - however, the performance is absolutely terrible. A 2048 x 2048 image takes a few seconds to parse.

I ran my code using +RTS -p -RTS and got the following interesting snippets from the report:

 total time = 1.08 secs total alloc = 3,037,568,120 bytes COST CENTRE MODULE %time %alloc readPixelMap Main 33.0 11.0 readPixelMap.getByte Main 32.7 75.1 readPixelMap.getColor Main 27.0 13.3 

It seems that my program allocates a huge amount of memory in the readPixelMap function. This function looks like this:

 readPixelMap width height = do pixels <- replicateM height (replicateM width getColor) return $ PixMap pixels where getByte = liftM toInteger getWord8 getColor = do (red:green:blue:alpha:xs) <- replicateM 4 getByte return (red, green, blue, alpha) 

and a PixMap is defined as

 data PixMap = PixMap [[RGBAColor]] deriving (Show) type RGBAColor = (Integer, Integer, Integer, Integer) 

readPixelMap is called using the Get monad from Data.Binary :

 main = do binary <- BS.readFile "test.tga" let (pixelMap, binary', nil) = runGetState (readPixelMap 2048 2048) binary 0 

I was under the impression that a 2048 x 2048.TGA image should be loaded into several hundred megabytes (no more) of memory. Is there an obvious way to improve garbage collection / placement?

+7
optimization haskell
source share
2 answers

Your RGBColor type uses 5 machine words, 4 of which are pointers to each Integer . Each Integer consists of one machine word for GC / tagging and one additional word for a small representation or large representation, consisting of a pointer to an array of bytes for GMP data and the number of limbs.

In addition, you use a list of lists to store values. Each non-empty list node is a word for GC / tagging, a word pointer to a value, and a word pointer to a tail.

To use less memory, use the appropriate data types. Use unpacked values โ€‹โ€‹whenever possible. Use Word8 instead of Integer for 8-bit values. Use arrays instead of lists. You know the basics of behavior, how you care about memory usage.

See http://johantibell.com/files/slides.pdf for some basics.

+12
source share
  • Using the Integer tuple to store color data (which is always 32 bits of data) does not make sense. Why not use one of the following options: (Word8, Word8, Word8, Word8) ; data Colour = C Word8 Word8 Word8 Word ; data Colour = C Word32

  • lazy ByteString is just a list of strict ByteString . Therefore, if you read a data byte by byte, you will essentially create a list of strict ByteString , whose length is 2048 * 2048. Not very efficient.

  • The same can be said about using [[a]] to store a two-dimensional array. In terms of memory performance, this is about the worst type of data you could use. Try any of them: Array (Int, Int) Color ; UArray (Int, Int) Color ; (Int, Int, Ptr Word8) . My preferences are arrays from Data.Array.Repa , which are highly efficient, have many built-in bypasses, bends, etc. And, of course, support parallelization. Another big advantage: Data.Array.Repa.Repr.ByteString.fromByteString converts a strict ByteString into an array for further manipulation. The best part of this is that you no longer have to worry about performance when reading an array; it was made for you!

Example:

 import Data.Array.Repa.Repr.ByteString import Data.Array.Repa import qualified Data.ByteString as B import Data.Word (Word8) data TGA = TGA (Array B DIM2 Word8) readTGAFromFile :: FilePath -> (Int, Int) -> IO TGA readTGAFromFile file (width,height) = do dat <- B.readFile file return $ TGA $ fromByteString (Z :. height :. width) dat 
+2
source share

All Articles