So, after compressing the last bit of performance from some Haskell, I use to break tweet data into n-grams, I am faced with the problem of space leakage. When I profile, GC uses about 60-70% of the process, and there are significant pieces of memory designed for drag and drop. Hope some Haskell gurus can suggest when I'm wrong.
{-# LANGUAGE OverloadedStrings, BangPatterns #-} import Data.Maybe import qualified Data.ByteString.Char8 as B import qualified Data.HashMap.Strict as H import Text.Regex.Posix import Data.List import qualified Data.Char as C isClassChar a = C.isAlphaNum a || a == ' ' || a == '\'' || a == '-' || a == '#' || a == '@' || a == '%' cullWord :: B.ByteString -> B.ByteString cullWord w = B.map C.toLower $ B.filter isClassChar w procTextN :: Int -> B.ByteString -> [([B.ByteString],Int)] procTextN nt = H.toList $ foldl' ngram H.empty lines where !lines = B.lines $ cullWord t ngram tr line = snd $ foldl' breakdown (base,tr) (B.split ' ' line) base = replicate (n-1) "" breakdown :: ([B.ByteString], H.HashMap [B.ByteString] Int) -> B.ByteString -> ([B.ByteString],H.HashMap [B.ByteString] Int) breakdown (st@(s:ss),tree) word = newStack `seq` expandedWord `seq` (newStack,expandedWord) where newStack = ss ++ [word] expandedWord = updateWord (st ++ [word]) tree updateWord :: [B.ByteString] -> H.HashMap [B.ByteString] Int -> H.HashMap [B.ByteString] Int updateWord wh = H.insertWith (+) w 1 h main = do test2 <- B.readFile "canewobble" print $ filter (\(a,b) -> b > 100) $ sortBy (\(a,b) (c,d) -> compare db) $ procTextN 3 test2
memory-leaks haskell space
Erik hinton
source share