The fastest way to remove duplicates in a large list of words?

A similar question was here , but they did not take into account why there is a difference in speed between sorting and awk .

At first I asked this question on Unix Stackexchange , but since they told me that this would be a good question for Stackoverflow, I will post it here.

I need to deduplicate a large list of words. I tried several commands and did some research here and here , where they explained that the fastest way to deduplicate a dictionary list seems to be using awk because awk doesn't sort the list. It uses hash search to track items and remove duplicates. Because AWK uses hash search, they claimed that this big O is like this

awk β†’ O (n)?
sort β†’ O (n log n)?

However, I found that this is not the case. Here are my test results. I created two random word lists using this python script .

List1 = 7 Mb
List2 = 690 Mb

Test teams

sort -u input.txt -o output.txt 

awk '!x[$0]++' input.txt > output.txt

AWK:
List1
real 0m1.643s
0m1.565s
sys 0m0.062s

List2
real 2m6.918s
2m4.499s
sys 0m1.345s

SORT:
List1
real 0m0.724s
0m0.666s
sys 0m0.048s

List2
real 1m27.254s
1m25.013s
sys 0m1.251s

. , SORT . - , ?

************ ***********
, ,

  • : ,
  • O. , - . (600MB)
  • : awk
+4
2

; 1,000,000 100 000 000, , 1% . , sort -u, , , . 100 000 000. 1 000 000 , 500 000 ( , 50%, 1%, ), :

% time awk '!x[$0]++' randomwordlist.txt > /dev/null
awk ...  1.32s user 0.02s system 99% cpu 1.338 total
% time sort -u randomwordlist.txt -o /dev/null
sort ...  14.25s user 0.04s system 99% cpu 14.304 total
+3
  • , N, O (N) , O (N * log N). , - O (N) ~ k1 * N + c1 O (N * log N) ~ k2 * N * log (N ) + c2. N, k c.
  • / k c.
  • .
  • ? 1 2, , . / - .
  • - , , : -)
+1

All Articles