Find the most frequent number in an array with limited memory

How to find the most frequent number in an array? The array can be very large, for example, 2 GB, and we have limited memory, say, 100 MB.

I am thinking of external sorting, which sorts and duplicates numbers that are next to each other. Or a hash. But I don’t know what to do with limited memory. And I'm not even sure that this is a good idea.

+7
sorting arrays algorithm
source share
5 answers

Assumptions:

  • An integer is 4 bytes.
  • In an array of 2 GB less (100 MB / 4 V) = (104857600/4) = 26214400 different integers. Each number represents an index range of 0-26214399.

Make a histogram.

  • Make buckets in our 100 MB space. This is an integer array named histogram that can store up to 26214400 counters. Counters are set to 0.
  • Iterate once through a 2 GB array. When you read x , make histogram[x]++ .
  • Find the maximum in the histogram , iterate through it once. If the maximum is histogram[i] , then i is the most frequent number.

The bottleneck is step 2, iterating through a 2 GB array, but we only do this once.

If the second assumption is not fulfilled (i.e. there are more than 26214400 different integers):

Make a histogram for numbers with indices from 0 to 26214399. Store the most frequent number from the histogram. Make a histogram for numbers with indices from 26214400 to 52428798. Store the most frequent number from the histogram and the previous most frequent number. And so on.

In the worst case, with 2 ^ 32 different numbers, it will do (2 ^ 32/26214400 + 1) = 164 iterations over this 2 GB array.

In general, it will perform (NUMBER_OF_DISTINCT_NUMBERS / 26214400 + 1) iterations.

0
source share

In the worst case, all your numbers are different from each other, except for one number that appears twice, and there is no way to find it in the main memory if you do not have two duplicate numbers loaded into the main memory at the same time, which is unlikely without sorting if your the total data size is much larger than the size of the main memory. In this case, aysmptotically, the best thing to do is sort the numbers in batches and save them to disk in files, and then read the merge merge step in all sorted files in memory several lines at a time and output the merged sorted list to a new file. Then you produce the complete sorted file in order and count how many times you see each number, keeping track of which number happened in most cases.

If you think that the most frequent number is 50% frequency or higher, then you can do much better. You can solve the problem with permanent extra memory just passing through the list of numbers once. Basically, you start by initializing the most common value (MCV) to the first number and initializing the counter N to 1. Then you look at the list. If the next number on the list is MCV, you increase N by one. Otherwise, you decrease N by 1. If N is 0 and the next number is different from MCV, you set MCV to a new number and set N to 1. It is easy to prove that this will end with the most common value stored in MCV.

+2
source share

If you have limited memory, but reasonable processing power and ultra-fast speed are not a problem, depending on your data set, you might try something like:

Iterate over the count of the number of numbers from 1 to 1000. Keep one with the largest count. Then recount 1001 to 2000. Keep the largest amount between them and the largest of the first batch. Repeat until all numbers are counted.

I am sure that there are many optimizations based on the specifics of your data set.

0
source share

Assuming 4-byte integers, you can put (100/4) = 25 MB of integers in available memory.

  • Read your large file, counting each occurrence and a number in the range 0 ... 25 MB-1. Use a large array to accumulate counters.

  • Find the number that occurs most often, save the number and its frequency, and clear the array.

  • Read a large file repeating the process of counting numbers in the range of 25 MB ... 50 MB-1.

  • Find the number that is most often found in the new array. Compare it with the number / frequency that you saved in step 2. Store the number / frequency with a higher frequency and clear the array.

  • Wet, rinse, repeat.

ETA: if we can assume that there is one single answer, then there are no two different numbers with the same maximum frequency, then you can drop all numbers if the array for a certain range shows the relationship. Otherwise, the problem of storing a winner for each range becomes more complex.

0
source share

Depending on what the hypotheses are, even the best way to do this can use the MJRTY algorithm:

http://www.cs.utexas.edu/~moore/best-ideas/mjrty/

Or its generalization:

http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASimpleAlgorithmForFindingFrequentElementsInStreamsAndBags.pdf

The idea is that with exactly two variables (counter and storage of values) you can determine if there is a majority element (the size of which is more than 50% of the time), what kind of element it is. A generalization requires (k + 1) counters and value stores to find elements that appear 100 / k%.

Because these are only candidates for the majority (if there are k-majority elements, these are they, but if there are no k-major elements than these are randomly random elements), a second pass to the data can help you get the exact number of candidates and determine which of them, if any, is an element of the majority.

It is very fast and efficient.

There are several other optimizations, but with 4 KB of memory, you are more likely to find a control that is 2 GB in size, depending on the type of data you have.

0
source share

All Articles