Find duplicate rows in a large file

The file contains a large number (e.g. 10 billion) of lines, and you need to find duplicate lines. You have N systems available. How do you find duplicates

+5
source share
2 answers

Divide the file into N parts. On each machine, load as much fragment as possible into memory and sort the lines. Write these pieces for mass storage on this machine. On each machine, combine the pieces into one stream, and then combine the stream from each machine into a stream containing all the rows in sorted order. Compare each line with the previous one. If they are the same, this is a duplicate.

+4
source

erickson, , , .

N -:

  • (, ) -, h.
  • h n , n = h% N.
  • , - h, , .
  • , , .

, 10 . - 80-120 32- , -. , , "", , , .

+8

All Articles