`uniq` without sorting a huge text file?

I have a stupidly large text file (i.e. 40 gigabytes today) that I would like to filter for unique lines without sorting the file.

The file has unix line endings and all content matches [[:print:]] . I tried the following awk script to display only unique lines:

 awk 'a[$0] {next} 1' stupid.txt > less_stupid.txt 

The idea was that I would fill the array, referring to its elements, using the contents of the file as keys, and then skip the lines that were already in the array. But this fails for two reasons: firstly, because it inexplicably just doesn't work (even on small test files), and secondly, because I know that my system will be exhausted before the entire set of unique lines will be loaded into memory by awk.

After searching, I found this answer , which recommended:

 awk '!x[$0]++' 

And while this works with small files, there will also be a lack of memory before reading my entire file.

What is the best (i.e. working) solution? I am open to almost everything, although I am more partial to decisions in languages ​​that I know (bash and awk, therefore, tags). When trying to visualize the problem, the best I came up with would be to store an array of line checksums or MD5, and not the lines themselves, but this only saves a little space and runs the risk of collision checksums.

Any advice would be greatly appreciated. This is not to say that this is impossible, and I stop trying to figure it out. :-P

+5
source share
6 answers

The triple awk '!x[$0]++' is one of the most elegant solutions for deduplicating a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it stores all unique lines in memory.

However, a much more efficient implementation would be to preserve the hash representation of the hashes of the strings in the array, not the entire string. You can achieve this with Perl on a single line, and it is very similar to an awk script.

 perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt 

Here I used md5_base64 instead of md5_hex , because base64 encoding takes 22 bytes, and the hexadecimal representation is 32.

However, since the Perl implementation of hashes still requires about 120 bytes for each key, you can quickly run out of memory for your huge file.

The solution in this case is to process the file in chunks, manually split or use GNU Parallel with --pipe, - -keep-order and -block options (taking advantage of the fact that duplicated lines are not so far apart, as you mentioned ) Here's how you could do it with parallel :

 cat huge.txt | pv | parallel --pipe --keep-order --block 100M -j4 -q \ perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt 

The --block 100M tells parallel processing of input in 100 MB chunks. -j4 means starting 4 processes in parallel. An important argument here is --keep-order , since you want the unique output lines to remain in the same order. I included pv in the pipeline to get great stats while the lengthy process is running.

In the test that I performed with a 1GB random data file, I reached a speed of 130 MB / s with the above settings, which means that you can delete a duplicate of your 40 GB file in 4 minutes (if you have a fast enough hard drive to write with such speed).

Other options:

  • Use an efficient trie structure for storing keys and checking for duplicates. For example, a very efficient implementation of marisa-trie is encoded in C ++ from a wrapper in Python .
  • Sort your huge file using external merge sort or distribution / bucket sort
  • Save the file to the database and use SELECT DISTINCT in the indexed column containing your rows or the most efficient md5_sums of your rows.
  • Or use color filters

Here is an example of using the Bloom :: Faster Perl module:

 perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt 

You can install Bloom::Faster from cran ( sudo cran and then install "Bloom::Faster" )

Explanation:

  • You must indicate the probabilistic error rate e and the number of available buckets n . The amount of memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines, you will need 100 million buckets and about 260 MB of memory.
  • The function $f->add($_) adds the hash of the string to the filter and returns true if the key (i.e. the string here) is a duplicate.
  • You can get an estimate of the number of unique lines in your file by analyzing a small section of your file with dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400 MB) and multiplying it by 100 (40 GB). Then set option n little higher to be safe.

In my tests, this method reached a processing speed of 6 MB / s. You can combine this approach with the GNU parallel proposal above to use multiple cores and achieve higher throughput.

+5
source

I have no data (or something like that), so I can’t verify this, but here is a proof of concept:

 $ t='one\ntwo\nthree\none\nfour\nfive\n' $ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2- one two three four five 

Our raw data includes one duplicate row. Pipes function as follows:

  • nl adds line numbers. This is a standard, inefficient unix tool.
  • sort first time "round sorts on SECOND field" is what would be the beginning of the line before nl . Adjust it as you need.
  • sort returns the second time everything in the order specified by the nl command.
  • cut just removes line numbers. There are several ways to do this, but some of them depend on your OS. This one is portable and works for my example.

Now ... For indecently large files, the sort command will need additional options. In particular, --buffer-size and --temporary-directory . Read more about man sort .

I can’t say that I expect this to be fast, and I suspect that you will be using a large amount of disk I / O, but I don’t understand why at least it doesn’t work.

+3
source

Assuming you can sort the file first (i.e. you can get the sort file to work), then I think something like this might work (it depends on whether the large awk script file is better than the large awk array in terms of memory usage, etc.).

 sort file | uniq -dc | awk '{gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++; if (x["NR"]>1){next}}"} END{print 7}' > dedupe.awk awk -f dedupe.awk file 

Which of the test input file, for example:

 line 1 line 2 line 3 line 2 line 2 line 3 line 4 line 5 line 6 

creates an awk script from:

 $0=="line 2"{x[1]++; if (x[1]>1){next}} $0=="line 3"{x[2]++; if (x[2]>1){next}} 7 

and run as awk -f dedupe.awk file outputs:

 line 1 line 2 line 3 line 4 line 5 line 6 

If the size of the awk script itself is a problem (perhaps unlikely), you can reduce this using a different sentinel value, for example:

 sort file | uniq -dc | awk 'BEGIN{print "{f=1}"} {gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++;f=(x["NR"]<=1)}"} END{print "f"}' 

which cuts out seven characters from each line (six if you also remove the space from the original) and generates:

 {f=1} $0=="line 2"{x[1]++;f=(x[1]<=1)} $0=="line 3"{x[2]++;f=(x[2]<=1)} f 

This solution is likely to work slower because it does not shorten the script as matches are found.

If the execution time of the awk script is too long, it may even be possible to improve the time by sorting the duplicate lines based on the number of matches (but is it important that the data is data dependent).

+3
source

I would do it like this:

 #! /bin/sh usage () { echo "Usage: ${0##*/} <file> [<lines>]" >&2 exit 1 } if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi if [ "$2" ]; then expr "$2" : '[1-9][0-9]*$' >/dev/null || usage fi LC_ALL=C export LC_ALL split -l ${2:-10000} -d -a 6 "$1" for x in x*; do awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x" done cat $(sort -n yx*) | sort | uniq -d | \ while IFS= read -r line; do fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \ while IFS=: read -r file nr rest; do sed -i -d ${nr}d "$file" done done cat $(sort -n yx*) >uniq_"$1" && rm -f yx* 

(proof of concept; more polishing required before use in production).

What's going on here:

  • split splits the file into pieces of 10,000 lines (configurable); the pieces are called x000000 , x000001 , ...
  • awk removes duplicates from each fragment without entering line order; resulting files: yx000000 , yx000001 , ... (since awk cannot transfer changes in place)
  • cat $(sort -n yx*) | sort | uniq -d cat $(sort -n yx*) | sort | uniq -d reassembles the pieces and finds a list of duplicates; due to how the pieces were built, each repeated line can be displayed no more than once in each fragment.
  • fgrep -x -n "$line" /dev/null yx* finds where each repeating line lives; the result is a list of lines yx000005:23:some text
  • sort -n | sed 1d sort -n | sed 1d removes the first fragment from the list above (this is the first occurrence of the line, and it should be left alone)
  • IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005 , nr=23 , and the rest
  • sed -i -e ${nr}d "$file" removes the string $nr from the fragment $file
  • cat $(sort -n yx*) collects pieces; they need to be sorted to make sure they arrive in the correct order.

This is probably not very fast, but I would say that it should work. Increasing the number of lines in each fragment of 10,000 can speed up work by using more memory. Operation O(N^2) in the number of repeated lines in pieces; with luck, that would not be too big.

The above assumes GNU sed (for -i ). It is also assumed that there are no files in the current directory with the name x* or yx* (this is the part that can use some cleanup, possibly moving garbage to the directory created by mktemp -d ).

Edit: Second version, after feedback from @EtanReisner:

 #! /bin/sh usage () { echo "Usage: ${0##*/} <file> [<lines>]" >&2 exit 1 } if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi if [ "$2" ]; then expr "$2" : '[1-9][0-9]*$' >/dev/null || usage fi tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1 dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1 trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM LC_ALL=C export LC_ALL split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x" ls -1 "$tdir" | while IFS= read -rx; do awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \ rm -f "${tdir}/$x" || exit 1 done find "$tdir" -type f -name 'yx*' | \ xargs -n 1 cat | \ sort | \ uniq -d >"$dupes" || exit 1 find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \ sed 's!.*/!!' | \ sort -t: -n -k 1.3,1 -k 2,2 | \ perl ' while(<STDIN>) { chomp; m/^(yx\d+):(\d+):(.*)$/o; if ($dupes{$3}++) { push @{$del{$1}}, int($2) } else { $del{$1} = [] } } undef %dupes; chdir $ARGV[0]; for $fn (sort <"yx*">) { open $fh, "<", $fn or die qq(open $fn: $!); $line = $idx = 0; while(<$fh>) { $line++; if ($idx < @{$del{$fn}} and $line == $del{$fn}->[$idx]) { $idx++ } else { print } } close $fh or die qq(close $fn: $!); unlink $fn or die qq(remove $fn: $!); } ' "$tdir" >uniq_"$1" || exit 1 
+3
source

If there are many duplicates, one of the possibilities is to split the file using split(1) into manageable parts and use something normal, such as sort / uniq, to ​​create a summary of unique lines. It will be less than the actual piece. After that, you can compare the parts to get the actual resume.

+1
source

Maybe not the answer you were looking for, but here: use a flowering filter. https://en.wikipedia.org/wiki/Bloom_filter This problem is one of the main reasons they exist.

+1
source

All Articles