Sort a 20 gigabyte file with one line per line

In question 11.5 of Gale Laakman's book Cracking a Technical Interview

"Imagine you have a 20 GB file with one line per line. Explain how you sort the file."

My initial reaction was exactly the solution that she proposed - breaking the file into smaller pieces (megabytes), reading data in X mb, sorting it, and then writing to disk. And at the very end merge the files.

I decided not to use this approach, because the final merge is associated with holding all the data in the main memory, and we assume that this is impossible. If so, how exactly is this decision implemented?

My other approach is based on the assumption that we have almost unlimited disk space, or at least enough to hold 2X of the data that we already have. We can read data in X mb, and then generate hash keys for them - each key corresponding to a line in the file. We will continue to do this until all values ​​are hashed. Then we just need to write the values ​​of this file to the source file.

Let me know what you think.

+6
source share
1 answer

http://en.wikipedia.org/wiki/External_sorting provides a more detailed explanation of how external sorting works. It is your concern to end up fetching the entire 20gB into memory, explaining how you perform the final merge of N sorted fragments by reading pieces of sorted pieces, rather than reading in all sorted pieces at the same time.

+3
source

All Articles