You can do this quite easily with a standard merge type algorithm.
Say you have 100 billion numbers and you want to get 10 billion. We will say that you can store 1 billion numbers in memory at any time.
So you make a pass:
while not end of input read 1 billion numbers sort them in descending order save position of output file write sorted numbers to output file
Then you have a file containing 100 blocks of 1 billion numbers. Each block is sorted in descending order.
Now create the maximum heap. Add the first number of each block to the heap. You will also need to add the block number or position number in the file so that you can read the next number.
Then:
while num_selected < 10 billion selected = heap.remove() ++num_selected write selected to output read next number from the selected block and place on heap
There was a little difficulty, keeping track of what number the number came from, but it's not so bad.
The maximum heap never contains more than 100 elements (basically, one element per block), so memory is not a problem in the second pass. With a small amount of work, you can avoid a large number of readings by creating a small buffer for each block so as not to incur the cost of reading a disk for each selected number.
Basically, it's just a disk sort, but from the very beginning.
The complexity of the first pass is b * (m log m) , where b is the number of blocks, and m is the number of elements in the block. N, the total number of elements in the file is b * m . The complexity of the second pass is k log b , where k is the number of elements to select, and b is the number of blocks.
Jim mischel
source share