In question 11.5 of Gale Laakman's book Cracking a Technical Interview
"Imagine you have a 20 GB file with one line per line. Explain how you sort the file."
My initial reaction was exactly the solution that she proposed - breaking the file into smaller pieces (megabytes), reading data in X mb, sorting it, and then writing to disk. And at the very end merge the files.
I decided not to use this approach, because the final merge is associated with holding all the data in the main memory, and we assume that this is impossible. If so, how exactly is this decision implemented?
My other approach is based on the assumption that we have almost unlimited disk space, or at least enough to hold 2X of the data that we already have. We can read data in X mb, and then generate hash keys for them - each key corresponding to a line in the file. We will continue to do this until all values ββare hashed. Then we just need to write the values ββof this file to the source file.
Let me know what you think.
source share