A similar question was here , but they did not take into account why there is a difference in speed between sorting and awk .
At first I asked this question on Unix Stackexchange , but since they told me that this would be a good question for Stackoverflow, I will post it here.
I need to deduplicate a large list of words. I tried several commands and did some research here and here , where they explained that the fastest way to deduplicate a dictionary list seems to be using awk because awk doesn't sort the list. It uses hash search to track items and remove duplicates. Because AWK uses hash search, they claimed that this big O is like this
awk β O (n)?
sort β O (n log n)?
However, I found that this is not the case. Here are my test results. I created two random word lists using this python script .
List1 = 7 Mb
List2 = 690 Mb
Test teams
sort -u input.txt -o output.txt
awk '!x[$0]++' input.txt > output.txt
AWK:
List1
real 0m1.643s
0m1.565s
sys 0m0.062s
List2
real 2m6.918s
2m4.499s
sys 0m1.345s
SORT:
List1
real 0m0.724s
0m0.666s
sys 0m0.048s
List2
real 1m27.254s
1m25.013s
sys 0m1.251s
. , SORT . - , ?
************ ***********
, ,