N-way merge sorts 2G file

This is another issue related to hacking encoding interviews. I still have some doubts after reading it.

9.4 If you have a 2 GB file with one string per line, which sorting algorithm would you use to sort the file and why? 

DECISION

When the interviewer gives a 2 GB size limit, he should tell you something - in this case, he assumes that they do not want you to put all the data into memory. So what do we do? We only bring some data into memory. Algorithm:

How much memory do we have? Suppose we have X M memory.

  • Divide the file into K chunks, where X * K = 2 GB. Bring each piece to memory and sort the lines, as usual, using any O (n log n) algorithm. Save the lines back to the file.

  • Now add the following snippet to memory and sort.

  • Once this has been done, combine them one by one.

The above algorithm is also known as appearance. Step 3 is known as N-shaped merging. The rationale for using external sorting is data size. Since the data is too large, and we cannot bring all this into memory, we need to switch to a disk-based sorting algorithm.

Doubt:

When in step 3, doing a merge sort, comparing 2 arrays, do we need 2 * X space every time we compare? And the limit was X MB. Should we make chunks in (X / 2) * 2K = 2 GB? So each piece will be X / 2 MB, and there will be 2K pieces. Or do I just understand that the merger is wrong? Thanks!

+8
java sorting algorithm
source share
3 answers

Firstly, step 3 itself is not a merge sort, it's all a merge sort. Step 3 is just a merge, without sorting at all.

And as for the required storage, there are two possibilities.

The first is to combine the sorted data into groups of two. Let's say you have three groups:

 A: 1 3 5 7 9 B: 0 2 4 6 8 C: 2 3 5 7 

Using this method, you combine A and B into one group Y , then merge Y and C into the final result Z :

 Y: 0 1 2 3 4 5 6 7 8 9 (from merging A and B). Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C). 

This has the advantage of a very small requirement for read-only memory in that you only need to store the β€œnext” item from each of the two lists, but, of course, you need to perform several merge operations.

The second way is the N-way β€œright” merge when you select the next item from any of the groups. In doing so, you will check the lowest value in each list to see which one will be next:

 Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C). 

This includes only one merge operation, but it requires more memory, basically one item in the list.

Which one you choose depends on the available memory and the size of the element.

For example, if you have 100M of memory available to you and the element size is 100K, you can use the latter. This is because for a 2G file you will need 20 groups (100 M each) for the sort phase, which means that the correct N-shaped merge will require 100 K by 20 or about 2 M, which is significantly lower than your memory availability.

Alternatively, suppose you only have 1M. It will be about 2000 (2G / 1M) groups and multiplying by 100K gives 200M, which greatly exceeds your capabilities.

So you will need to do this merge in a few passes. Keep in mind that this should not be multiple passes combining the two lists.

You can find an average site where, for example, each passage combines ten lists. Ten groups of 100K is only mega, so they will fit into your memory limit, and this will result in fewer merge passes.

+6
source share

http://en.wikipedia.org/wiki/External_sorting

A quick look at Wikipedia tells me that during the merger process, you never keep a whole bunch in mind. So basically, if you have K-chunks, you will have K open file pointers, but you will only hold one line from each file in memory at any given time. You compare the lines that you have in memory, and then output the smallest (say, from fragment 5) to the sorted file (also the open file pointer, not the memory), and then rewrite this line with the next line from this file ( in our example, file 5) into memory and repeat until you reach the end of all the pieces.

+9
source share

The merge process is much simpler. You will output them to a new file, but basically you only need read-only memory: you only need to read one of the two input files once.

+2
source share

All Articles