The triple awk '!x[$0]++' is one of the most elegant solutions for deduplicating a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it stores all unique lines in memory.
However, a much more efficient implementation would be to preserve the hash representation of the hashes of the strings in the array, not the entire string. You can achieve this with Perl on a single line, and it is very similar to an awk script.
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt
Here I used md5_base64 instead of md5_hex , because base64 encoding takes 22 bytes, and the hexadecimal representation is 32.
However, since the Perl implementation of hashes still requires about 120 bytes for each key, you can quickly run out of memory for your huge file.
The solution in this case is to process the file in chunks, manually split or use GNU Parallel with --pipe, - -keep-order and -block options (taking advantage of the fact that duplicated lines are not so far apart, as you mentioned ) Here's how you could do it with parallel :
cat huge.txt | pv | parallel --pipe --keep-order --block 100M -j4 -q \ perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt
The --block 100M tells parallel processing of input in 100 MB chunks. -j4 means starting 4 processes in parallel. An important argument here is --keep-order , since you want the unique output lines to remain in the same order. I included pv in the pipeline to get great stats while the lengthy process is running.
In the test that I performed with a 1GB random data file, I reached a speed of 130 MB / s with the above settings, which means that you can delete a duplicate of your 40 GB file in 4 minutes (if you have a fast enough hard drive to write with such speed).
Other options:
- Use an efficient trie structure for storing keys and checking for duplicates. For example, a very efficient implementation of marisa-trie is encoded in C ++ from a wrapper in Python .
- Sort your huge file using external merge sort or distribution / bucket sort
- Save the file to the database and use SELECT DISTINCT in the indexed column containing your rows or the most efficient md5_sums of your rows.
- Or use color filters
Here is an example of using the Bloom :: Faster Perl module:
perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt
You can install Bloom::Faster from cran ( sudo cran and then install "Bloom::Faster" )
Explanation:
- You must indicate the probabilistic error rate
e and the number of available buckets n . The amount of memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines, you will need 100 million buckets and about 260 MB of memory. - The function
$f->add($_) adds the hash of the string to the filter and returns true if the key (i.e. the string here) is a duplicate. - You can get an estimate of the number of unique lines in your file by analyzing a small section of your file with
dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400 MB) and multiplying it by 100 (40 GB). Then set option n little higher to be safe.
In my tests, this method reached a processing speed of 6 MB / s. You can combine this approach with the GNU parallel proposal above to use multiple cores and achieve higher throughput.