Finding the Nth smallest value in a group of numbers as they are created

I am writing a program, than I need to find the Nth largest value in a group of numbers. These numbers are generated by the program, but I do not have enough memory to store N numbers. Is there a better upper bound than N that can be reached for storage? The upper bound for the size of a group of numbers (and for N) is approximately 100,000,000.

Note. Numbers are decimal places, and the list may contain duplicates.

[Edit]: my memory limit is 16 MB.

+4
source share
5 answers

This is a multi-pass algorithm (therefore, you should be able to generate the same list several times or store the list in secondary storage).

First pass:

  • Find the maximum value and the lowest value. This is your starting range.

Passes after the first:

  • Divide the range up to 10 equally spaced bins. We do not need to store any numbers in baskets. We are just going to count bunker membership. Thus, we only have an array of integers (or bigints - everything that can accurately hold our accounts). Note that 10 is an arbitrary choice for the number of boxes. Your choice of size and distribution will determine the best choice.
  • Scroll each number in the data, increasing the counter of the place where you hold the number that you see.
  • Find out which bunker your answer is in, and add how many numbers are above this bin to the number of numbers above the winning box.
  • The upper and lower ranges of the winning basket are your new range.
  • Repeat these steps again until you have enough memory to store the numbers in the current bin.

Last pass:

  • You need to know how many digits are already above the current bin.
  • You have enough memory to capture all the numbers in your current bin range so you can scroll and capture the actual numbers. Just sort them and take the correct number.

Example: if the range you see is from 0.0 to 1000.0, the ranges of your bins will be:

(- 0.0 - 100.0] (100.0 - 200.0] (200.0 - 300.0] ... (900.0 - 1000.0) 

If you find that your number is in the bunker (100.0 - 2000.0), the following set of bins will be as follows:

  (100.0 - 110.0] (110.0 - 120.0] etc. 

Another multi-pass idea:

Just do a binary search. Select the midpoint of the range as your first guess. Your omissions simply have to make the score higher / lower to determine the next score (which can be weighted by a score or a simple average for code simplicity).

+3
source

Can you restore the same group of numbers from the beginning? If so, you can make several passes over the output: start by finding the highest value, restart the generator, find the largest number less than this, restart the generator and repeat this until you get your result.

It will be a real productivity killer because you have many numbers and many passes will be required - but from the point of view of memory you will only need to save 2 elements (current maximum and โ€œlimitโ€), the number that you found during the last pass) and passage counter.

You can speed it up by using the priority turn to find the M largest elements (by selecting some M that you can store in memory), allowing you to reduce the number of passes needed for N / M.

If you need to find, say, the 10th largest item in a list of 15 numbers, you could save time by working the other way around. Since this is the 10th largest element, this means that there are 15-10 = 5 elements smaller than that element, so instead you can search for the 6th smallest element.

+3
source

This seems like another question - A program to search for the nth smallest element in an array without sorting? - where you can get answers.
> Logic will work for the Nth smallest search. Note. I am not saying that this is a duplicate.


Since you have many (about 1 billion?) Numbers, here is another way to optimize space.
Assume that your numbers correspond to 32-bit values, so about 1 billion will be required for approximately 32 GB of space. Now, if you can afford about 128 MB of working memory, we can do it in one go.

  • Imagine a 1-bit bit vector stored as an array of 32-bit words
    • Let it be initialized with all zeros
    • Start executing your numbers and keep setting the correct bit position for the value of the number
    • When you are done with one run, start counting from the beginning of this bit vector for the Nth bit
    • This bit position gives you the value for your Nth number.
    • In fact, you sorted all the numbers in the process (however, the number of duplicates is not tracked).
+1
source

If I understood correctly, the use of the upper memory limit for your program is O (N) (possibly N + 1). You can save a list of generated values โ€‹โ€‹that are larger than the current X (by far the largest N value), ordered by smallest. As soon as a new larger value is generated, you can replace the current X with the first element of the list and insert the value just generated at its corresponding position in the list.

0
source

sort -n | uniq -c and Nth should be the Nth line

0
source

All Articles