The main problem seems to be related to the amount of memory, so we are looking for a solution that uses a small amount of memory. A way to save memory is to use a sorted vector instead of map . Now vector has search time with ~ log (n) comparisons and average insertion time of n / 2, which is bad. The growth potential is mostly small, the memory you need to move is small due to data sharing, and you get sequential memory (cache compatibility), which can easily surpass map . The required memory will be 2 (wordcount) + 4 (index) + 1 ( \0 - char) + x (word length) bytes per word. To do this, we need to get rid of std::string , because in this case it is too big.
You can split map into vector<char> , which stores the lines one by one, separated by \0 characters, a vector<unsigned int> for the index and vector<short int> for word count. The code will look something like this (verified):
#include <vector> #include <algorithm> #include <cstring> #include <string> #include <fstream> #include <iostream> std::vector<char> strings; std::vector<unsigned int> indexes; std::vector<short int> wordcount; const int countlimit = 1000; void insertWord(const std::string &str) { //find the word auto stringfinder = [](unsigned int lhs, const std::string &rhs) { return &strings[lhs] < rhs; }; auto index = lower_bound(begin(indexes), end(indexes), str, stringfinder); //increment counter if (index == end(indexes) || strcmp(&strings[*index], str.c_str())) { //unknown word wordcount.insert(begin(wordcount) + (index - begin(indexes)), 1); indexes.insert(index, strings.size()); strings.insert(end(strings), str.c_str(), str.c_str() + str.size() + 1); } else { //known word auto &count = wordcount[index - begin(indexes)]; if (count < countlimit) //prevent overflow count++; } } int main() { std::ifstream f("input.txt"); std::string s; while (f >> s) { //not a good way to read in words insertWord(s); } for (size_t i = 0; i < indexes.size(); ++i) { if (wordcount[i] > countlimit) { std::cout << &strings[indexes[i]] << ": " << wordcount[i] << '\n'; } } }
This approach stores all the words in memory. According to Wolfram Alpha, the average word length in English is 5.1 characters. This gives you the general memory requirement (5.1 + 7) * 1bn bytes = 12.1bn bytes = 12.1GB. Assuming you have halfway a modern computer with 16 + GB of RAM, you can put it in RAM.
If this fails (because you do not have English words, and they do not fit into memory), the next approach would be memory mapping files. This way you can make indexes pointer to a memory mapped file instead of strings , so you can get rid of strings , but access time will suffer.
If this fails due to poor performance, you should study map-reduce , which is very easy to apply to this case. This gives you the same performance as computers.