Search for k-largest elements of a very large file (while k is very BIG)

Suppose we have a very large file containing billions of integers, and we want to find the k largest elements of these values,

the tricky part is that k itself is also very large, which means that we cannot contain the elements of k in memory (for example, we have a file with 100 elements based on a billion, and we want to find the 10 billion largest elements )

How to do this in O(n) ?

What I thought:

We start reading the file, and we check it with another file that stores the k largest elements (sorted in ascending order), if the reading element is larger than the first line of the second file, we delete the first line and we insert it into the second file, the time complexity will be O(NlogK) (if we have random access to this file, otherwise it will be "O (Nk)"

Any idea to do this in O(n) , I think if we have an external version of the Selection algorithm (partitioning algorithm in quicksort), we could do it in O(n) , but I could not find it anywhere

+7
algorithm large-files
source share
5 answers

PS: My definition of K is different. This is a small number, for example, 2 or 100 or 1000. Here m corresponds to the OPS definition for k. Sorry.

Depends on how many readings you can make from the source data and how much more space you have. This approach assumes that you have extra space equivalent to the source data.

Step 1: choosing random numbers K from all data
Step 2: Sort the K numbers (suppose the index is from 1 to K)
Step 3: Create K + 1 Separate Files And Name Them 0 To K
Step 4: for each element in the data, if it is between the i-th and i-th element, put it in the i-th file.
Step 5: Based on the size of each file, select the file that will be mth.
Step 6: Repeat with the new file and the new m (new_m = m - sum_of_size_of_all_lower_files)

Regarding the last step, if K = 2, m = 1000 and the file size 0 - 800, 1 - 900, and 2 - 200, new_m = m-800 = 200 and iteratively work with file 1.

+3
source share

You can do this quite easily with a standard merge type algorithm.

Say you have 100 billion numbers and you want to get 10 billion. We will say that you can store 1 billion numbers in memory at any time.

So you make a pass:

 while not end of input read 1 billion numbers sort them in descending order save position of output file write sorted numbers to output file 

Then you have a file containing 100 blocks of 1 billion numbers. Each block is sorted in descending order.

Now create the maximum heap. Add the first number of each block to the heap. You will also need to add the block number or position number in the file so that you can read the next number.

Then:

 while num_selected < 10 billion selected = heap.remove() ++num_selected write selected to output read next number from the selected block and place on heap 

There was a little difficulty, keeping track of what number the number came from, but it's not so bad.

The maximum heap never contains more than 100 elements (basically, one element per block), so memory is not a problem in the second pass. With a small amount of work, you can avoid a large number of readings by creating a small buffer for each block so as not to incur the cost of reading a disk for each selected number.

Basically, it's just a disk sort, but from the very beginning.

The complexity of the first pass is b * (m log m) , where b is the number of blocks, and m is the number of elements in the block. N, the total number of elements in the file is b * m . The complexity of the second pass is k log b , where k is the number of elements to select, and b is the number of blocks.

+8
source share

you can do this by maintaining a minimum heap of maximum size k .

  • Every time a new number arrives - check if the heap is smaller, and then k , if there is one - add it.

  • If this is not the case, check to see if the minimum is less, and then a new element, and if there is one, pull it out and insert a new element instead.

When you are done, you have a heap containing k largest elements. This solution is O (nlogk) complexity, where n is the number of elements and k is the number of required elements.

  • This can also be done in O (n) using the selection algorithm - save all elements, and then find (k+1)th the largest element and return everything that is larger than it. But it is more difficult to implement for a reasonable input of size - maybe not better. In addition, if the stream contains duplicates, more processing is required.
+3
source share

If all the values ​​are different or we can ignore the doublets and we have 32-bit integers, I would just use one bit per possible value (2 ^ 32 bits = 2 ^ 29 bytes = 512 megabytes required (should fit into your RAM) )

  • Initialize 512 MB with 0
  • When reading a file linearly ( O (n) ), set the corresponding bit for each value read.
  • Finally, find the first k set of bits to get the highest k values. ( O (2 ^ 32) )

If the values ​​do not differ, and you want to know how often the values ​​occur, you can add the 4th step when you read the file again and count the number of occurrences of the values ​​found in the first 3 steps. This is still O (n) .

+2
source share

Use randomized selection to find the kth largest element in the file. You can do this linearly with many passes over the input, if it is not too funny many times more than memory. Then just throw away everything that is at least as big.

0
source share

All Articles