In the worst case, all your numbers are different from each other, except for one number that appears twice, and there is no way to find it in the main memory if you do not have two duplicate numbers loaded into the main memory at the same time, which is unlikely without sorting if your the total data size is much larger than the size of the main memory. In this case, aysmptotically, the best thing to do is sort the numbers in batches and save them to disk in files, and then read the merge merge step in all sorted files in memory several lines at a time and output the merged sorted list to a new file. Then you produce the complete sorted file in order and count how many times you see each number, keeping track of which number happened in most cases.
If you think that the most frequent number is 50% frequency or higher, then you can do much better. You can solve the problem with permanent extra memory just passing through the list of numbers once. Basically, you start by initializing the most common value (MCV) to the first number and initializing the counter N to 1. Then you look at the list. If the next number on the list is MCV, you increase N by one. Otherwise, you decrease N by 1. If N is 0 and the next number is different from MCV, you set MCV to a new number and set N to 1. It is easy to prove that this will end with the most common value stored in MCV.
user2566092
source share