I wrote a demon in Haskell that dumps information from a web page every 5 minutes.
The daemon initially worked fine for about 50 minutes, but then it unexpectedly died using out of memory (requested 1048576 bytes) . Every time I ran it, he died after the same amount of time. Setting it only for 30 seconds, he instead died after 8 minutes.
I realized that the code for cleaning the website was incredibly inefficient (starting at 30 M, and slept up to 250 M when parsing 9M html), so I rewrote it, so now it only uses about 15 M for parsing. Thinking about the problem was fixed, I ran the daemon overnight, and when I woke up, it actually used less memory than that night. I thought I was done, but 20 hours after it started, it crashed with the same error.
I started to learn ghc profiling, but I was not able to get this to work. Then I started messing around with rts options , and I tried setting -H64m to set the default heap size to larger than my program used, and also using -Ksize to reduce the maximum stack size to make sure that it crashes earlier .
Despite every change I made, the daemon still crashes after a constant number of iterations. The assimilation of parsing by higher memory efficiency made this value higher, but it still falls. This does not make sense to me, because none of them work, even came close to using all my memory, and even more so to exchanging places. The heap size is assumed to be unlimited by default, the stack size reduction does not matter, and all my ulimits are either unlimited, or significantly exceed what the daemon uses.
In the source code, I found that the crash was somewhere in the html parsing, but I did not do the same for a more efficient version of the memory, because it takes 20 hours to take so long. I don’t know if it would be useful to know, because it doesn’t look like any particular part of the program is broken, because it starts successfully for dozens of iterations before the crash.
From the ideas, I even looked at the ghc source code for this error and seems to be an unsuccessful mmap call, which was not very useful to me, because I assume this is not the root of the problem.
(Edit: code rewritten and moved to the end of the message)
I'm new to Haskell, so I hope this is some kind of quirk of lazy appreciation or something else that can be quickly fixed. Otherwise, I am fresh from ideas.
I am using GHC version 7.4.2 on FreeBsd 9.1
Edit:
Replacing the download with static html got rid of the problem, so I narrowed it down before I use http-conduit. I changed the code above to include my network code. The hackage docs mention sharing with the manager, so I did it. And he also says that for http you need to explicitly close the connections, but I don't think I need to do this for httpLbs .
Here is my code.
import Control.Monad.IO.Class (liftIO) import qualified Data.Text as T import qualified Data.ByteString.Lazy as BL import Text.Regex.PCRE import Network.HTTP.Conduit main :: IO () main = do manager <- newManager def daemonLoop manager daemonLoop :: Manager -> IO () daemonLoop manager = do rows <- scrapeWebpage manager putStrLn $ "number of rows parsed: " ++ (show $ length rows) doSleep daemonLoop manager scrapeWebpage :: Manager -> IO [[BL.ByteString]] scrapeWebpage manager = do putStrLn "before makeRequest" html <- makeRequest manager -- Force evaluation of html. putStrLn $ "html length: " ++ (show $ BL.length html) putStrLn "after makeRequest" -- Breaks ~10M html table into 2d list of bytestrings. -- Max memory usage is about 45M, which is about 15M more than when sleeping. return $ map tail $ html =~ pattern where pattern :: BL.ByteString pattern = BL.concat $ replicate 12 "<td[^>]*>([^<]+)</td>\\s*" makeRequest :: Manager -> IO BL.ByteString makeRequest manager = runResourceT $ do defReq <- parseUrl url let request = urlEncodedBody params $ defReq -- Don't throw errors for bad statuses. { checkStatus = \_ _ -> Nothing -- 1 minute. , responseTimeout = Just 60000000 } response <- httpLbs request manager return $ responseBody response
and outputs:
before makeRequest html length: 1555212 after makeRequest number of rows parsed: 3608 ... before makeRequest html length: 1555212 after makeRequest bannerstalkerd: out of memory (requested 2097152 bytes)
Getting rid of regex calculations fixed the problem, but it seems that the error occurs after the network and during regex, apparently because of something that I am doing wrong with http-conduit. Any ideas?
Also, when I try to compile with profiling, I get this error:
Could not find module `Network.HTTP.Conduit' Perhaps you haven't installed the profiling libraries for package `http-conduit-1.8.9'?
In fact, I have not installed the profile libraries for http-conduit , and I do not know how to do this.