I get a Heap exhausted message when I run the next short Haskell program on a fairly large dataset. For example, a program crashes (with heap overflow) on an input file of 20 MB with lines of about 900 thousand lines. The heap size has been set (via -with-rtsopts ) to 1 GB. It works fine if longestCommonSubstrB is defined as something simpler, for example. commonPrefix . I need to process files in 100 MB order.
I compiled the program with the following command line (GHC 7.8.3):
ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs
I would appreciate any help to ensure that this work is carried out in a reasonable amount (in the order of the size of the input file), but I am especially grateful for the thoughtful process of finding the bottleneck and where and how to force rigor.
I suppose that some coercive function longestCommonSubstrB for evaluation will strictly solve the problem, but I don't know how to do it.
{-# LANGUAGE BangPatterns #-} module Main where import System.Environment (getArgs) import qualified Data.ByteString.Lazy.Char8 as B import Data.List (maximumBy, sort) import Data.Function (on) import Data.Char (isSpace) -- | Returns a list of lexicon items, ie [[w1,w2,w3]] readLexicon :: FilePath -> IO [[B.ByteString]] readLexicon filename = do text <- B.readFile filename return $ map (B.split '\t' . stripR) . B.lines $ text where stripR = B.reverse . B.dropWhile isSpace . B.reverse transformOne :: [B.ByteString] -> B.ByteString transformOne (w1:w2:w3:[]) = B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3] transformOne a = error $ "transformOne: unexpected tuple " ++ show a longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $ [f xs' ys | xs' <- B.tails xs] ++ [f xs ys' | ys' <- tail $ B.tails ys] where f xs' ys' = scanl g B.empty $ B.zip xs' ys' gz (x, y) = if x == y then z `B.snoc` x else B.empty main :: IO () main = do (input:output:_) <- getArgs lexicon <- readLexicon input let flattened = B.unlines . sort . map transformOne $ lexicon B.writeFile output flattened
This is the profile for the test data set (100 thousand lines, the heap size is set to 1 GB, i.e. generateSample.exe 100000 , the resulting file size is 2.38 MB):
Heap Profile Over Time:

Execution Statistics:
3,505,737,588 bytes allocated in the heap 785,283,180 bytes copied during GC 62,390,372 bytes maximum residency (44 sample(s)) 216,592 bytes maximum slop 96 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 6697 colls, 0 par 1.05s 1.03s 0.0002s 0.0013s Gen 1 44 colls, 0 par 4.14s 3.99s 0.0906s 0.1935s INIT time 0.00s ( 0.00s elapsed) MUT time 7.80s ( 9.17s elapsed) GC time 3.75s ( 3.67s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 1.44s ( 1.35s elapsed) EXIT time 0.02s ( 0.00s elapsed) Total time 13.02s ( 12.85s elapsed) %GC time 28.8% (28.6% elapsed) Alloc rate 449,633,678 bytes per MUT second Productivity 60.1% of total user, 60.9% of total elapsed
Time Distribution and Distribution Profile Report:
SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt total time = 3.97 secs (3967 ticks @ 1000 us, 1 processor) total alloc = 2,321,595,564 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc longestCommonSubstrB Main 43.3 33.1 longestCommonSubstrB.f Main 21.5 43.6 main.flattened Main 17.5 5.1 main Main 6.6 5.8 longestCommonSubstrB.g Main 5.0 5.8 readLexicon Main 2.5 2.8 transformOne Main 1.8 1.7 readLexicon.stripR Main 1.8 1.9 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 45 0 0.1 0.0 100.0 100.0 main Main 91 0 6.6 5.8 99.9 100.0 main.flattened Main 93 1 17.5 5.1 89.1 89.4 transformOne Main 95 100000 1.8 1.7 71.6 84.3 longestCommonSubstrB Main 100 100000 43.3 33.1 69.8 82.5 longestCommonSubstrB.f Main 101 1400000 21.5 43.6 26.5 49.5 longestCommonSubstrB.g Main 104 4200000 5.0 5.8 5.0 5.8 readLexicon Main 92 1 2.5 2.8 4.2 4.8 readLexicon.stripR Main 98 0 1.8 1.9 1.8 1.9 CAF GHC.IO.Encoding.CodePage 80 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 74 0 0.0 0.0 0.0 0.0 CAF GHC.IO.FD 70 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 66 0 0.0 0.0 0.0 0.0 CAF System.Environment 65 0 0.0 0.0 0.0 0.0 CAF Data.ByteString.Lazy.Char8 54 0 0.0 0.0 0.0 0.0 CAF Main 52 0 0.0 0.0 0.0 0.0 transformOne Main 99 0 0.0 0.0 0.0 0.0 readLexicon Main 96 0 0.0 0.0 0.0 0.0 readLexicon.stripR Main 97 1 0.0 0.0 0.0 0.0 main Main 90 1 0.0 0.0 0.0 0.0
UPDATE: The following program can be used to generate sample data. It expects one argument - the number of rows in the generated dataset. The generated data will be saved in the sample.txt file. When I create a dataset of 900 thousand lines with it (when I run generateSample.exe 900000 ), the created dataset makes the above program unsuccessful with heap overflow (the heap size was set to 1 GB). The resulting dataset is about 20 MB.
module Main where import System.Environment (getArgs) import Data.List (intercalate, permutations) generate :: Int -> [(String,String,String)] generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki") where f = cycle . permutations main :: IO () main = do (n:_) <- getArgs let flattened = unlines . map f $ generate (read n :: Int) writeFile "sample.txt" flattened where f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]