Haskell Performance and Volatile Structures

Question

Haskell Performance and Volatile Structures

I studied the answers given in Haskell Code Optimization , and noticed that using a little input will actually lead to faster Haskell execution compared to Python.

But as the size of the dataset grew, Python took the lead. Using a hashmap-based version improved performance, but it still lagged.

Even worse, I tried transliterating Python dictionaries into hash tables, and watched a massive performance hit. I really want to understand what is happening, because for the future application I will need changed structures.

Here's a slightly modified Python code:

#! /usr/bin/env python2.7 import random import re import cPickle class Markov: def __init__(self, filenames): self.filenames = filenames self.cache = self.train(self.readfiles()) picklefd = open("dump", "w") cPickle.dump(self.cache, picklefd) print "Built a db of length "+str(len(self.cache)) picklefd.close() def train(self, text): splitted = text.split(' ') print "Total of %d splitted words" % (len(splitted)) cache = {} for i in xrange(len(splitted)-2): pair = (splitted[i], splitted[i+1]) followup = splitted[i+2] if pair in cache: if followup not in cache[pair]: cache[pair][followup] = 1 else: cache[pair][followup] += 1 else: cache[pair] = {followup: 1} return cache def readfiles(self): data = "" for filename in self.filenames: fd = open(filename) data += fd.read() fd.close() return data Markov(["76.txt"])

Haskell with the original answer ( train4 ), its hash map variant ( trainHM2 ) and hash table transliteration ( trainHT ):

 {-# LANGUAGE BangPatterns,DeriveGeneric #-} import GHC.Generics (Generic) import Data.List (foldl') import Data.Hashable import qualified Data.Map as M import qualified Data.HashMap.Strict as HM import qualified Data.ByteString.Char8 as B import qualified Data.HashTable.IO as HT --Using this instead of tuples yielded a 5~10% speedup data StringTuple = STP !B.ByteString !B.ByteString deriving(Ord,Eq,Generic) instance Hashable StringTuple type Database3 = M.Map StringTuple (M.Map B.ByteString Int) type DatabaseHM = HM.HashMap StringTuple (HM.HashMap B.ByteString Int) type DatabaseHT = HT.BasicHashTable StringTuple DatabaseInnerHT type DatabaseInnerHT = (HT.BasicHashTable B.ByteString Int) train4 :: [B.ByteString] -> Database3 train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words)) where update m (x,y,z) = M.insertWith' (inc z) (STP xy) (M.singleton z 1) m inc k _ = M.insertWith' (+) k 1 trainHM2 :: [B.ByteString] -> DatabaseHM trainHM2 words = trainHM2G words HM.empty where trainHM2G (x:y:[]) !hm = hm trainHM2G (x:y:z:rem) !hm = trainHM2G (y:z:rem) (HM.insertWith (inc z) (STP xy) (HM.singleton z 1) hm) where inc k _ = HM.insertWith (+) k 1 trainHT :: [B.ByteString] -> IO (DatabaseHT) trainHT words = do hm <- HT.new trainHT' words hm where trainHT' (x:y:[]) !hm = return hm trainHT' (x:y:z:rem) !hm = do let pair = STP xy inCache <- HT.lookup hm pair case inCache of Nothing -> do htN <- HT.new :: IO (DatabaseInnerHT) HT.insert htN z $! 1 HT.insert hm pair $! htN Just ht -> do cvM <- HT.lookup ht z case cvM of Nothing -> HT.insert ht z 1 Just cv -> HT.insert ht z $! (cv+1) trainHT' (y:z:rem) hm main = do contents <- B.readFile "76.txt" let bcont = B.split ' ' $ contents print $ length bcont let db = train4 $ bcont print $ "Built a DB of " ++ show (M.size db) ++ " words" --let db = trainHM2 $ bcont --print $ "Built a DB of " ++ show (HM.size db) ++ " words" --db <- trainHT $ (bcont) --print $ "Built a DB"

Longitudinal transliteration C ++ 11 (requires compilation first, feel free to fix it):

 #include <iostream> #include <fstream> #include <sstream> #include <vector> #include <unordered_map> #include <tuple> /* Hash stuff here Taken from https://stackoverflow.com/a/7111460/314327 */ size_t hash_combiner(size_t left, size_t right) //replacable { return left^right;} template<int index, class...types> struct hash_impl { size_t operator()(size_t a, const std::tuple<types...>& t) const { typedef typename std::tuple_element<index, std::tuple<types...>>::type nexttype; hash_impl<index-1, types...> next; size_t b = std::hash<nexttype>()(std::get<index>(t)); return next(hash_combiner(a, b), t); } }; template<class...types> struct hash_impl<0, types...> { size_t operator()(size_t a, const std::tuple<types...>& t) const { typedef typename std::tuple_element<0, std::tuple<types...>>::type nexttype; size_t b = std::hash<nexttype>()(std::get<0>(t)); return hash_combiner(a, b); } }; namespace std { template<class...types> struct hash<std::tuple<types...>> { size_t operator()(const std::tuple<types...>& t) { const size_t begin = std::tuple_size<std::tuple<types...>>::value-1; return hash_impl<begin, types...>()(1, t); //1 should be some largervalue } }; } /* Hash stuff end */ using namespace std; /* Split, from https://stackoverflow.com/a/236803/314327 */ vector<string> &split(const string &s, char delim, vector<string> &elems) { stringstream ss(s); string item; while (getline(ss, item, delim)) { elems.push_back(item); } return elems; } vector<string> split(const string &s, char delim) { vector<string> elems; split(s, delim, elems); return elems; } /* Split end */ typedef tuple<string,string> STP; unordered_map< STP,unordered_map< string,int > > train(vector<string> &words) { unordered_map< STP,unordered_map< string,int > > cache; for(int i=0;i<words.size()-2;i++) { STP tup = make_tuple(words[i],words[i+1]); auto it = cache.find(tup); if(it!=cache.end()) { auto it2 = it->second.find(words[i+2]); if(it2!=it->second.end()) { it2->second += 1; } else it->second[words[i+2]] = 1; } else { unordered_map< string,int > cacheInner; cacheInner[words[i+2]] = 1; cache[tup] = cacheInner; } } return cache; } int main() { ifstream ifs("76.txt"); stringstream buf; buf << ifs.rdbuf(); string contents(buf.str()); auto words = split(contents,' '); cout << words.size(); auto wordCache = train(words); cout << "\nHashtable count " << wordCache.size(); cout << "\n"; return 0; }

And the results:

C ++ (GCC 4.6.3)

 $ g++ -O3 -fpermissive -std=c++0x cpptest.cpp -o cpptest $ /usr/bin/time -f "%E" ./cpptest 1255153 Hashtable count 64442 0:01.02

Python (2.7)

 $ /usr/bin/time -f "%E" ./pytest.py Total of 1255153 splitted words Built a db of length 64442 0:02.62

Haskell (GHC 7.4.1) - "train4"

 $ ghc -fllvm -O2 -rtsopts -fforce-recomp -funbox-strict-fields hasktest.hs -o hasktest [1 of 1] Compiling Main ( hasktest.hs, hasktest.o ) Linking hasktest ... $ /usr/bin/time -f "%E" ./hasktest 1255153 "Built a DB of 64442 words" 0:06.35

Haskell - "trainHM2"

 $ /usr/bin/time -f "%E" ./hasktest 1255153 "Built a DB of 64442 words" 0:04.23

Haskell - "trainHT" - Using the base case (which is close to what Python does for dictionaries, I think?)

 $ /usr/bin/time -f "%E" ./hasktest 1255153 "Built a DB" 0:10.42

Using Linear or Cuckoo for both tables

 0:06.06 0:05.69

Using cuckoo for the outermost table and linear inside

 0:04.17

Profiling showed that there are quite a lot of GC, so with + RTS-A256M

 0:02.11

_{For the input, I selected 76.txt , as indicated in one of the answers, and duplicated the entire text 12 times, It should be about 7 MB.}

_{Tests were performed on Ubuntu 12.04 in a VirtualBox container using a single i5-520M kernel.} _{More than one run was done, all results were pretty close.}

The latter result is pretty good for this microbenchmark, but there is something else to improve the code , given that:

Cuckoo and Linear may be better suited for this dataset, but Python's “general” solution can go without significant optimization in this regard.
Valgrind reports that C ++ and Python versions accept approximately 60MBs , while Haskell RTS reports that H2C2 is from 125MBs (Cuckoo & Linear) to 409MBs (main, large heap) of memory for the same task. Will this affect the garbage collector in the production environment? Is it possible to reorganize the code so that it has less memory?

Update:

I think “garbage reduction” is what I am looking for. I know that Haskell does not work the way C ++ does, but I want to know if the garbage created in the imperative code can be reduced since the C ++ example consumed half the memory without any space leaks. Hopefully this will be an improvement in terms of memory usage and runtime (since there will be less GC).

Update 2:

Calculating the length while building the table certainly reduced the amount of memory (up to 40MBs , actually!), Which causes the GC to take longer, which leads to a slower runtime (due to dropping values that were lazily read from the list, I suppose ?).

And yes, hashtables operations take a considerable amount of time. I will try to imitate the changes to see if it has improved further.

+7

performance hashtable haskell mutable

felace Oct 9 '13 at 12:30

source share

1 answer

John l · Answer 1 · 2013-10-09T06:39:04+0000

This is actually not an answer, but too much to add comments, so I will take it here until something better appears. I don't see anything clearly wrong with your hash code (the only part I really looked at), except for a bit of refactoring / golfing.

Firstly, memory usage is not very surprising to me. With -A256M you require the RTS to have a minimum allocation area of 256M, so you have laid out the floor to use your memory. If data is advanced or copied after the GC, memory usage will increase. Also note that Haskell data structures tend to be slightly memory-hungry in relation to other languages, see, for example, The amount of memory for Haskell data types . Given both of these factors, I am not surprised at the overall memory usage with a large distribution area.

Structures such as HashMap or bytestring trie can use less memory, while the attendant drawbacks use a data structure other than a hash table.

Speaking of the selection area, this code is a small amount of microbusiness, since almost all the selected data (mainly error data and internal values of the hash table) are durable (they are stored until the end of the program). This puts your test program in a situation where a very large allocation area is especially useful, whereas if this database design was only part of a larger program, the cost of a large area could become dominant.

As for the optimal GC settings for the production environment, it is very difficult to say outside the context of a real full program. I can say that if performance really matters, it's worth spending some time tuning GC flags. Especially if you have enabled a multi-threaded runtime.

Besides the memory issues, I strongly suspect that the hashtables package works against you here. According to the profile, the 4 most expensive functions are: lookup/go , lookup , insert and delete'.go . I think that if it had the equivalent of Data.Map.alter , some of your operations could be combined to improve performance. I would be very surprised if Python dictionaries were not optimized for cases like cache[key] += 1 .

Haskell Performance and Volatile Structures

More articles: