Need a quick map between string and integers

I have a line map and unsigned in which I save a word in my frequency of the following form:

map<string,unsigned> mapWordFrequency; //contains 1 billion such mappings 

Then I read a huge file (100 GB) and saved only those words in the file that have a frequency of more than 1000. I check the frequency of words in the file using: mapWordFrequency [word]> 1000. However, it turns out that my mapWordFrequency has 1 billion mappings , and my file is huge, so trying to check mapWordFrequency [word]> 1000 for each word in the file is very slow and takes more than 2 days. Can anyone suggest how I can improve the efficiency of the above code.

the card does not fit in my memory and the exchange takes a lot of time.

Will erasing all words having a frequency of <1000 help use the erase card feature?

+5
source share
6 answers

I suggest you use unordered_map rather than map . As already discussed in the comments, the first one will give you O(1) insert / extract time as opposed to O(logn) in map .

As you said, memory sharing takes a lot of time. So, how about solving the problem gradually. Load maximum data and unordered_map into memory, hash and continue. After one pass, you should have a lot of unordered_maps, and you can start combining them in subsequent passes.

You can increase speed by doing this in a distributed way. Processing pieces of data on different computers, and then combining data (which will be in the form of disordered maps. However, I do not have previous experience in distributed computing, and therefore I can not help with this.

Also, if implementing something like this is too cumbersome, I suggest you use an external mergesort. This is a file sorting method that is too large to fit in memory, sorting smaller pieces and merging them. The reason I propose this is because external mergesort is a fairly common technique, and you can find already implemented solutions for your needs. Although the temporal complexity of sorting is higher than your idea of ​​using map , this will reduce the overhead of the exchange compared to the map. As pointed out in the comments, sort on linux implements an external mergesort.

+4
source

You can use a hash map where your hashed string will be key and value will be value. It will be faster. You can choose a good string hash based on your requirement. Here is the link for some good hash function:

http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

You can also use some third-party libraries.

EDIT: Pseudo Code

 int mapWordFrequency[MAX_SIZE] = {0} ;// if MAX_SIZE is large go with dynamic memory location int someHashMethod(string input); loop: currString in ListOfString int key = someHashMethod(currString); ++mapWordFrequency[key]; if(mapWordFrequency[key] > 1000) doSomeThing(); 

Update: As @Jens noted, there may be times when someHashMethod () returns the same int (hash) for two different lines. In this case, we must resolve the conflict, and the search time will be longer than constant. Also, since the input size is very large, creating one array of this size may not be possible. In this case, we can use distributed computing concepts, but the actual search time will again increase compared to a single machine.

+2
source

Depending on the statistical distribution of your words, it may be worth squeezing each word before adding it to the map. While this compression is lossless, you can restore the original words after filtering. The idea is that you can reduce the average word size (therefore, save memory and key comparison time). Here you can simply follow the compression / decompression procedure:

 #include <string> #include <sstream> #include <boost/iostreams/filtering_streambuf.hpp> #include <boost/iostreams/filter/zlib.hpp> #include <boost/iostreams/copy.hpp> inline std::string compress(const std::string& data) { std::stringstream decompressed {data}; boost::iostreams::filtering_streambuf<boost::iostreams::input> stream; stream.push(boost::iostreams::zlib_compressor()); stream.push(decompressed); std::stringstream compressed {}; boost::iostreams::copy(stream, compressed); return compressed.str(); } inline std::string decompress(const std::string& data) { std::stringstream compressed {data}; boost::iostreams::filtering_streambuf<boost::iostreams::input> stream; stream.push(boost::iostreams::zlib_decompressor()); stream.push(compressed); std::stringstream decompressed; boost::iostreams::copy(stream, decompressed); return decompressed.str(); } 

In addition to using std::unordered_map , as others have suggested, you can also move any words that you have already seen more than 1000 times from the map to std::unordered_set . It will also require checking the set in front of the map, but you can see higher hash performance by doing this. This can sometimes be useful if you use this strategy.

+1
source

You need a different approach to your problem, your data is too large to be processed immediately. For example, you can split a file into several files, even though the easiest way is to logically split them into letters.

 100GB/24 letters = 4.17 GB 

You will now have 24 4.17GB files. You know that words in any of the files cannot be part of any other file, this will help you, since you do not have to combine the results. With a 4 GB file, it is now easier to work in ram.

std::map there is a problem when you start to use a lot of memory, because it fragmentes a lot. Try std::unordered_map , and if it still doesn't work, you can load the file into memory and sort it. Counting events will be quite simple.

Assuming you have multiple duplicates, your map or unordered_map will have significantly less memory space.

Run your code in a loop, for each file, and add the results to another file. You should do it pretty quickly.

+1
source

The main problem seems to be related to the amount of memory, so we are looking for a solution that uses a small amount of memory. A way to save memory is to use a sorted vector instead of map . Now vector has search time with ~ log (n) comparisons and average insertion time of n / 2, which is bad. The growth potential is mostly small, the memory you need to move is small due to data sharing, and you get sequential memory (cache compatibility), which can easily surpass map . The required memory will be 2 (wordcount) + 4 (index) + 1 ( \0 - char) + x (word length) bytes per word. To do this, we need to get rid of std::string , because in this case it is too big.

You can split map into vector<char> , which stores the lines one by one, separated by \0 characters, a vector<unsigned int> for the index and vector<short int> for word count. The code will look something like this (verified):

 #include <vector> #include <algorithm> #include <cstring> #include <string> #include <fstream> #include <iostream> std::vector<char> strings; std::vector<unsigned int> indexes; std::vector<short int> wordcount; const int countlimit = 1000; void insertWord(const std::string &str) { //find the word auto stringfinder = [](unsigned int lhs, const std::string &rhs) { return &strings[lhs] < rhs; }; auto index = lower_bound(begin(indexes), end(indexes), str, stringfinder); //increment counter if (index == end(indexes) || strcmp(&strings[*index], str.c_str())) { //unknown word wordcount.insert(begin(wordcount) + (index - begin(indexes)), 1); indexes.insert(index, strings.size()); strings.insert(end(strings), str.c_str(), str.c_str() + str.size() + 1); } else { //known word auto &count = wordcount[index - begin(indexes)]; if (count < countlimit) //prevent overflow count++; } } int main() { std::ifstream f("input.txt"); std::string s; while (f >> s) { //not a good way to read in words insertWord(s); } for (size_t i = 0; i < indexes.size(); ++i) { if (wordcount[i] > countlimit) { std::cout << &strings[indexes[i]] << ": " << wordcount[i] << '\n'; } } } 

This approach stores all the words in memory. According to Wolfram Alpha, the average word length in English is 5.1 characters. This gives you the general memory requirement (5.1 + 7) * 1bn bytes = 12.1bn bytes = 12.1GB. Assuming you have halfway a modern computer with 16 + GB of RAM, you can put it in RAM.

If this fails (because you do not have English words, and they do not fit into memory), the next approach would be memory mapping files. This way you can make indexes pointer to a memory mapped file instead of strings , so you can get rid of strings , but access time will suffer.

If this fails due to poor performance, you should study map-reduce , which is very easy to apply to this case. This gives you the same performance as computers.

+1
source

@TonyD Can you give a small example with trie? - Rose of Sharma

Here is an example of a three-step approach to this problem:

 #include <iostream> #include <string> #include <limits> #include <array> class trie { public: void insert(const std::string& s) { node_.insert(s.c_str()); } friend std::ostream& operator<<(std::ostream& os, const trie& t) { return os << t.node_; } private: struct Node { Node() : freq_(0) { } uint16_t freq_; std::array<Node*, 26> next_letter_{}; void insert(const char* p) { if (*p) { Node*& p_node = next_letter_[*p - 'a']; if (!p_node) p_node = new Node; p_node->insert(++p); } else if (freq_ < std::numeric_limits<decltype(freq_)>::max()) ++freq_; } } node_; friend std::ostream& operator<<(std::ostream& os, const Node& n) { os << '('; if (n.freq_) os << n.freq_ << ' '; for (size_t i = 0; i < 26; ++i) if (n.next_letter_[i]) os << char('a' + i) << *(n.next_letter_[i]); return os << ')'; } }; int main() { trie my_trie; my_trie.insert("abc"); my_trie.insert("abcd"); my_trie.insert("abc"); my_trie.insert("bc"); std::cout << my_trie << '\n'; } 

Output:

 (a(b(c(2 d(1 ))))b(c(1 ))) 

The output is a compressed / tree-like representation of your histogram of verbal frequency: abc appears 2 times, abcd 1 , bc 1 . Brackets can be considered as clicking and popping up characters from the "stack" to form the current prefix or - when there is a numerical word.

How much improves on the map depends on the variations of the input words, but it's worth a try. A more efficient implementation using memory can use vector or set - or even string , for example, space-separated suffixes when there are several elements under the current prefix, then switch to an array of 26 pointers when this may require less memory.

+1
source

All Articles