A little late, but I studied it a bit and I think that the accounting of Dietrich Epp is right, but it can be a little simplified. Note that no real python programming occurs in the python file: it organizes a very simple sequence of calls to C string operations, and then into the implementation of the C hash table. (This is often a problem with really simple python v. Haskell test examples.) In contrast Haskell creates a huge persistent Map , which is a fancy tree. Thus, the main points of opposition are C vs Haskell and a hash table with a devastating update against a persistent map. Since there are few matches in the input file, the tree you create includes all the information in the input line, part of which is repeated and then rebuilt using a bunch of Haskell constructors. This, I think, is the source of anxiety that you experience, but it can be explained.
Compare these two files using ByteString:
import qualified Data.Map as M import qualified Data.ByteString.Char8 as B main = do m <- fmap proc (B.readFile "baseforms.txt") print (M.lookup (B.pack "sometext") m) print (M.size m) proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs) . map B.words . B.lines
and another Text-ified equivalent:
import qualified Data.Map as M import qualified Data.ByteString.Char8 as B import Data.Text.Encoding(decodeUtf8) import qualified Data.Text as T main = do m <- fmap proc (B.readFile "baseforms.txt") print (M.lookup (T.pack "sometext") m) print (M.size m) proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs) . map T.words . T.lines . decodeUtf8
On my machine, python / C takes less than 6 seconds, the bytestring file takes 8 seconds, and the text file takes just over 10.
The bytestring implementation seems to use a little more memory than python, and the textual implementation is noticeably larger. Implementing text takes longer because, of course, it adds the transformation to the text, and then uses text operations to break lines and text comparisons to build the map.
Here is an analysis of the analysis of memory phenomena in the text case. First we have bytes in memory (130 m). After the text is built (~ 250 m, to judge unscientificly from what is happening in the top ), bytestring - garbage collection when we build a tree. Ultimately, the text tree (about 380 m) uses more memory than the byte tree (~ 260 m), because there are more pieces of text in the tree. The program as a whole uses more, because the text stored in memory during the construction of the tree itself is more. Roughly speaking: each bit of white space turns into a tree constructor and two text constructors together with a text version of what was the first "word" of the string, and whatever the textual representation of the next word. The weight of the designers, in any case, is about 130 m, so at the last moment of building the tree, we use something like 130m + 130m + 130m = 390m in the byte case and 250m + 130m + 250m = 630m in the text box.