Fixing a particularly obscure Haskell space leak

So, after compressing the last bit of performance from some Haskell, I use to break tweet data into n-grams, I am faced with the problem of space leakage. When I profile, GC uses about 60-70% of the process, and there are significant pieces of memory designed for drag and drop. Hope some Haskell gurus can suggest when I'm wrong.

{-# LANGUAGE OverloadedStrings, BangPatterns #-} import Data.Maybe import qualified Data.ByteString.Char8 as B import qualified Data.HashMap.Strict as H import Text.Regex.Posix import Data.List import qualified Data.Char as C isClassChar a = C.isAlphaNum a || a == ' ' || a == '\'' || a == '-' || a == '#' || a == '@' || a == '%' cullWord :: B.ByteString -> B.ByteString cullWord w = B.map C.toLower $ B.filter isClassChar w procTextN :: Int -> B.ByteString -> [([B.ByteString],Int)] procTextN nt = H.toList $ foldl' ngram H.empty lines where !lines = B.lines $ cullWord t ngram tr line = snd $ foldl' breakdown (base,tr) (B.split ' ' line) base = replicate (n-1) "" breakdown :: ([B.ByteString], H.HashMap [B.ByteString] Int) -> B.ByteString -> ([B.ByteString],H.HashMap [B.ByteString] Int) breakdown (st@(s:ss),tree) word = newStack `seq` expandedWord `seq` (newStack,expandedWord) where newStack = ss ++ [word] expandedWord = updateWord (st ++ [word]) tree updateWord :: [B.ByteString] -> H.HashMap [B.ByteString] Int -> H.HashMap [B.ByteString] Int updateWord wh = H.insertWith (+) w 1 h main = do test2 <- B.readFile "canewobble" print $ filter (\(a,b) -> b > 100) $ sortBy (\(a,b) (c,d) -> compare db) $ procTextN 3 test2 
+8
memory-leaks haskell space
source share
1 answer

One small optimization is filtering the data (using HashMap.filter ) before sorting it. This helped me shave off 2 seconds from the final runtime. Another thing I did was to use sequences ( Data.Sequence ) instead of lists (no noticeable difference :-(). My version can be found here .

Looking at the heap profile, I don't think there is a space leak in your program: Space profile

You simply create a fairly large hash table (377141 key-value pairs) in memory, and then discard it after some processing. According to the Johan post , a hash table of this size takes approx. 5 * N + 4 * (N-1) words = 3394265 * 4 bytes ~ = 13 MiB, which is consistent with what the heap profile shows. The rest of the space is occupied by keys and values. On my machine, the time spent in the GC is about 40%, which does not seem unreasonable, given that you constantly update the hash table and temporary "stacks" without doing anything complicated with the data. Since the only operation that needs a hash table is insertWith , is it probably better to use a mutable data structure ?

Update : I rewrote your program using a mutable hash table. Interestingly, the speed difference is not big, but memory usage is slightly better:

enter image description here

As you can see, the size of the block allocated for the hash table remains constant throughout the execution.

+7
source share

All Articles