Find duplicate rows in a large file

Question

Find duplicate rows in a large file

The file contains a large number (e.g. 10 billion) of lines, and you need to find duplicate lines. You have N systems available. How do you find duplicates

+5

string algorithm

Tushar gupta Oct 9 '10 at 18:19

source share

2 answers

erickson, , , .

N -:

(, ) -, h.
h n , n = h% N.
, - h, , .
, , .

, 10 . - 80-120 32- , -. , , "", , , .

+8

Steve Jessop 09 . '10 19:27

erickson · Accepted Answer · 2010-10-09T18:26:14+0000

Divide the file into N parts. On each machine, load as much fragment as possible into memory and sort the lines. Write these pieces for mass storage on this machine. On each machine, combine the pieces into one stream, and then combine the stream from each machine into a stream containing all the rows in sorted order. Compare each line with the previous one. If they are the same, this is a duplicate.

Find duplicate rows in a large file

More articles: