How to optimize merge sort?

I have two files of 1 GB, each of which contains only numbers in sorted order. Now I know how to read the contents of files and sort them using the merge sorting algorithm and output it to another file, but I am wondering how to do this only using a buffer size of 100 MB (I do not worry about scratching the space). For example, one way is to read 50 MB of fragments from both files and sort them, and as I sort, I could read a new element and continue the process until I get to the end of both files (can someone give me an idea of ​​how to implement this).

+4
source share
5 answers

It seems that you only need to combine the numbers in your files, and not sort them, since they are already sorted in each file. The merge merge sort part is as follows:

 function merge(left,right) var list result while length(left) > 0 or length(right) > 0 if length(left) > 0 and length(right) > 0 if first(left) ≀ first(right) append first(left) to result left = rest(left) else append first(right) to result right = rest(right) else if length(left) > 0 append left to result break else if length(right) > 0 append right to result break end while return result 

Now you can simply read the first 50 MB of numbers from both files in two buffers, apply the merge algorithm, then when one of the buffers has been exhausted (all of its analyzed numbers), read another 50 MB from the necessary file, You do not need to sort anything.

You just need a condition that checks when one of your buffers is empty. When this is the case, read more from the file that the buffer is associated with.

+6
source

Why not use a standard library?

 #include <fstream> #include <iterator> #include <algorithm> int main() { std::ifstream in1("in1.txt"); std::ifstream in2("in2.txt"); std::ofstream ut("ut.txt"); std::istream_iterator<int> in1_it(in1); std::istream_iterator<int> in2_it(in2); std::istream_iterator<int> in_end; std::ostream_iterator<int> ut_it(ut, "\n"); std::merge(in1_it, in_end, in2_it, in_end, ut_it); } 
+4
source

You probably want to read / write in reasonable pieces to avoid I / O overhead. Therefore, probably use three buffers ~ 30M, input1, input2 and output.

Continue until none of the input buffers is empty or the output buffer is full, then read / write to replenish / empty the empty / full buffer.

This way you write / read large chunks of data from disk.

In addition, you need asynchronous I / O to read / write data during sorting. But this is probably too great.

+3
source

Since you are only doing a merge, not a full sort, this is just a basic merge cycle. Purely serial input-output. No need to worry about buffers. Put the zipper on the jacket. It is so simple. (Note: this can be much faster if the numbers are in binary format in files. Not only will the files be smaller, but the program will be limited by I / O and the numbers will be absolutely accurate.)

 double GetNumberFromFile(FILE file){ if (feof(file)){ return BIGBIGNUMBER; } else { return ReadADouble(file); } } double A = GetNumberFromFile(AFILE); double B = GetNumberFromFile(BFILE); while (A < BIGBIGNUMBER && B < BIGBIGNUMBER){ if (A < B){ write A; A = GetNumberFromFile(AFILE); } else if (B < A){ write B; B = GetNumberFromFile(BFILE); } else { write A; write B; // or not, if you want to eliminate duplicates A = GetNumberFromFile(AFILE); B = GetNumberFromFile(BFILE); } } while (A < BIGBIGNUMBER){ write A; A = GetNumberFromFile(AFILE); } while (B < BIGBIGNUMBER){ write B; B = GetNumberFromFile(BFILE); } 

When answering your question, consider a simpler problem by copying one file to another. You only perform sequential I / O in which the file system is really good. You write a simple loop to read small units, such as bytes or int from a file, and write them to another. As soon as you try to read a byte, the system allocates a good large buffer, intercepts a large chunk of the file into the buffer, and then transfers the byte from the buffer to you. He continues to do this until you need another buffer, when he invisibly looks into another one for you. The same thing happens with the file you are writing. Now the processor is quite fast, so it can iterate over the input bytes, copying them at the output, for a fraction of the time spent reading or writing a buffer, because reading or writing cannot go faster than external equipment. The only reason a larger buffer will help is because part of the read / write time is called the so-called β€œlatency”, mainly the time it takes to move the head to the desired track and wait for the desired sector to appear. Most file systems break files into chunks that sprinkle around the disk, so the head jumps anyway. You can hear it.

The only difference between copying and a merge algorithm like yours is reading two files, not one. In any case, the base time sequence is a series of buffers that are read and written with a small amount of CPU activity. (It is possible to perform overlapping I / O operations, so that the CPU takes place during I / O, so basically there are no delays between reading the buffer and writing, but this was more significant when the processors were 1000 times slower.)

Of course, if you can arrange it so that all the files that have been read and written are on separate physical disks and the disks are not very fragmented, then the amount of head movement could be minimized, and larger buffers could help, But basically, with a simple program, you can pretty much expect that simple code will run as fast as a disk can move data, and giant buffers can help, but not so much.

0
source

Benchmark. Read value by value and read the block. Feel the difference! =)

0
source

All Articles