Haskell requires more memory than Python when I read a map from a file. What for?

I have this simple code in Python:

input = open("baseforms.txt","r",encoding='utf8') S = {} for i in input: words = i.split() S.update( {j:words[0] for j in words} ) print(S.get("sometext","not found")) print(len(S)) 

It requires 300 MB to work. The size of "baseforms.txt" is 123M. I wrote the same code in Haskell:

 {-# LANGUAGE OverloadedStrings #-} import qualified Data.Map as M import qualified Data.ByteString.Lazy.Char8 as B import Data.Text.Lazy.Encoding(decodeUtf8) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as I import Control.Monad(liftM) main = do text <- B.readFile "baseforms.txt" let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text) print (M.lookup "sometext" m) print (M.size m) where parseLine line = let base:forms = T.words line in [(f,base)| f<-forms] 

It takes 544 MB and is slower than the Python version. What for? Is it possible to optimize the version of Haskell?

+7
haskell
source share
2 answers

A little late, but I studied it a bit and I think that the accounting of Dietrich Epp is right, but it can be a little simplified. Note that no real python programming occurs in the python file: it organizes a very simple sequence of calls to C string operations, and then into the implementation of the C hash table. (This is often a problem with really simple python v. Haskell test examples.) In contrast Haskell creates a huge persistent Map , which is a fancy tree. Thus, the main points of opposition are C vs Haskell and a hash table with a devastating update against a persistent map. Since there are few matches in the input file, the tree you create includes all the information in the input line, part of which is repeated and then rebuilt using a bunch of Haskell constructors. This, I think, is the source of anxiety that you experience, but it can be explained.

Compare these two files using ByteString:

 import qualified Data.Map as M import qualified Data.ByteString.Char8 as B main = do m <- fmap proc (B.readFile "baseforms.txt") print (M.lookup (B.pack "sometext") m) print (M.size m) proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs) . map B.words . B.lines 

and another Text-ified equivalent:

 import qualified Data.Map as M import qualified Data.ByteString.Char8 as B import Data.Text.Encoding(decodeUtf8) import qualified Data.Text as T main = do m <- fmap proc (B.readFile "baseforms.txt") print (M.lookup (T.pack "sometext") m) print (M.size m) proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs) . map T.words . T.lines . decodeUtf8 

On my machine, python / C takes less than 6 seconds, the bytestring file takes 8 seconds, and the text file takes just over 10.

The bytestring implementation seems to use a little more memory than python, and the textual implementation is noticeably larger. Implementing text takes longer because, of course, it adds the transformation to the text, and then uses text operations to break lines and text comparisons to build the map.

Here is an analysis of the analysis of memory phenomena in the text case. First we have bytes in memory (130 m). After the text is built (~ 250 m, to judge unscientificly from what is happening in the top ), bytestring - garbage collection when we build a tree. Ultimately, the text tree (about 380 m) uses more memory than the byte tree (~ 260 m), because there are more pieces of text in the tree. The program as a whole uses more, because the text stored in memory during the construction of the tree itself is more. Roughly speaking: each bit of white space turns into a tree constructor and two text constructors together with a text version of what was the first "word" of the string, and whatever the textual representation of the next word. The weight of the designers, in any case, is about 130 m, so at the last moment of building the tree, we use something like 130m + 130m + 130m = 390m in the byte case and 250m + 130m + 250m = 630m in the text box.

+2
source share

In the Haskell version, a lot of things happen in the Python version.

  • readFile uses lazy IO, which is generally a bit strange. I would generally avoid lazy I / O.

  • A file, like bytestring, is split into lines, which are then decoded as UTF-8. This seems a bit unnecessary given the existence of Text IO functions.

  • The Haskell version uses a tree ( Data.Map ), while the Python version uses a hash table.

  • The lines are all lazy, which is probably not necessary if they are relatively short. Lazy lines have a few line overhead words that can add up. You could merge the lazy lines, or you could read the file right away, or you could use something like a conduit.

  • GHC uses a copy collector, while the default Python implementation uses malloc() with reference counting and random GC. This fact alone can explain the big differences in memory usage, depending on your program.

  • Who knows how many thunks are created in the Haskell version.

  • It is not known whether you have included optimizations.

  • It is not known how much slower the version of Haskell is.

  • We do not have your data file, so we cannot verify it ourselves.

+20
source share

All Articles