Divide the file into N parts. On each machine, load as much fragment as possible into memory and sort the lines. Write these pieces for mass storage on this machine. On each machine, combine the pieces into one stream, and then combine the stream from each machine into a stream containing all the rows in sorted order. Compare each line with the previous one. If they are the same, this is a duplicate.
source
share