What is the difference in performance between gawk and ....?

This question was discussed here in Meta and my answer is to give a link to a test system to answer this.


The question often arises whether to use gawk or mawk or C or some other language because of performance, so create a canonical question / answer for a trivial and typical awk program.

The result of this will be the answer that provides a comparison of the performance of various tools that perform the basic tasks of text processing matching regular expressions and splitting fields into a simple input file. If tool X is twice as fast as any other tool for this task, then this is useful information. If all the tools take about the same amount of time, then this is also useful information.

How this will work is that over the next few days, many people will participate in the “answers”, which are proven programs, and then one person (volunteers?) Will test them all on the same platform (or several people will test some subsets on our platform so that we can compare), and then all the results will be collected in one answer.

Given the 10 million line input file created by this script:

$ awk 'BEGIN{for (i=1;i<=10000000;i++) print (i%5?"miss":"hit"),i," third\t \tfourth"}' > file $ wc -l file 10000000 file $ head -10 file miss 1 third fourth miss 2 third fourth miss 3 third fourth miss 4 third fourth hit 5 third fourth miss 6 third fourth miss 7 third fourth miss 8 third fourth miss 9 third fourth hit 10 third fourth 

and given this awk script, which prints the fourth and then the 1st and then the 3rd field of each line, starting with "hit", followed by an even number:

 $ cat tst.awk /hit [[:digit:]]*0 / { print $4, $1, $3 } 

Here are the first 5 lines of the expected output:

 $ awk -f tst.awk file | head -5 fourth hit third fourth hit third fourth hit third fourth hit third fourth hit third 

and here is the result when it is connected to the second awk script, to make sure that the main script above actually functions exactly as intended:

 $ awk -f tst.awk file | awk '!seen[$0]++{unq++;r=$0} END{print ((unq==1) && (seen[r]==1000000) && (r=="fourth hit third")) ? "PASS" : "FAIL"}' PASS 

The following are the results of synchronizing the third run of gawk 4.1.1, running in bash 4.3.33 on cygwin64:

 $ time awk -f tst.awk file > /dev/null real 0m4.711s user 0m4.555s sys 0m0.108s 

Please note that the above is the third execution to remove differences in caching.

Can anyone provide equivalent C, perl, python, any code for this:

 $ cat tst.awk /hit [[:digit:]]*0 / { print $4, $1, $3 } 

i.e. look for THET REGEXP in the line (we are not looking for any other solution that works around the need for regular expression), split the line on each series of adjacent spaces and print the fourth, then 1st, then 3rd fields, separated by a single empty char?

If so, we can test them on the same platform to see / record performance differences.


Code entered so far:

AWK (can be tested against gawk, etc., but mawk, nawk, and possibly others will require [0-9] instead of [: digit:])

 awk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file 

Php

 php -R 'if(preg_match("/hit \d*0 /", $argn)){$f=preg_split("/\s+/", $argn); echo $f[3]." ".$f[0]." ".$f[2];}' < file 

shell

 egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}' grep --mmap -E "^hit [[:digit:]]*0 " file | awk '{print $4, $1, $3 }' 

Ruby

 $ cat tst.rb File.open("file").readlines.each do |line| line.gsub(/(hit)\s[0-9]*0\s+(.*?)\s+(.*)/) { puts "#{$3} #{$1} #{$2}" } end $ ruby tst.rb 

Perl

 $ cat tst.pl #!/usr/bin/perl -nl # A solution much like the Ruby one but with atomic grouping print "$4 $1 $3" if /^(hit)(?>\s+)(\d*0)(?>\s+)((?>[^\s]+))(?>\s+)(?>([^\s]+))$/ $ perl tst.pl file 

Python

 none yet 

FROM

 none yet 
+5
source share
5 answers

Using egrep before awk gives a lot of speedup:

 paul@home ~ % wc -l file 10000000 file paul@home ~ % for i in {1..5}; do time egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}' | wc -l ; done 1000000 egrep --color=auto 'hit [[:digit:]]*0 ' file 0.63s user 0.02s system 85% cpu 0.759 total awk '{print $4, $1, $3}' 0.70s user 0.01s system 93% cpu 0.760 total wc -l 0.00s user 0.02s system 2% cpu 0.760 total 1000000 egrep --color=auto 'hit [[:digit:]]*0 ' file 0.65s user 0.01s system 85% cpu 0.770 total awk '{print $4, $1, $3}' 0.71s user 0.01s system 93% cpu 0.771 total wc -l 0.00s user 0.02s system 2% cpu 0.771 total 1000000 egrep --color=auto 'hit [[:digit:]]*0 ' file 0.64s user 0.02s system 82% cpu 0.806 total awk '{print $4, $1, $3}' 0.73s user 0.01s system 91% cpu 0.807 total wc -l 0.02s user 0.00s system 2% cpu 0.807 total 1000000 egrep --color=auto 'hit [[:digit:]]*0 ' file 0.63s user 0.02s system 86% cpu 0.745 total awk '{print $4, $1, $3}' 0.69s user 0.01s system 92% cpu 0.746 total wc -l 0.00s user 0.02s system 2% cpu 0.746 total 1000000 egrep --color=auto 'hit [[:digit:]]*0 ' file 0.62s user 0.02s system 88% cpu 0.727 total awk '{print $4, $1, $3}' 0.67s user 0.01s system 93% cpu 0.728 total wc -l 0.00s user 0.02s system 2% cpu 0.728 total 

against

 paul@home ~ % for i in {1..5}; do time gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.46s user 0.04s system 97% cpu 2.548 total gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.43s user 0.03s system 98% cpu 2.508 total gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.40s user 0.04s system 98% cpu 2.489 total gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.38s user 0.04s system 98% cpu 2.463 total gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.39s user 0.03s system 98% cpu 2.465 total 

'nawk' is even slower!

 paul@home ~ % for i in {1..5}; do time nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.05s user 0.06s system 92% cpu 6.606 total nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.11s user 0.05s system 96% cpu 6.401 total nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 5.78s user 0.04s system 97% cpu 5.975 total nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 5.71s user 0.04s system 98% cpu 5.857 total nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.34s user 0.05s system 93% cpu 6.855 total 
+4
source

In OSX Yosemite

 time bash -c 'grep --mmap -E "^hit [[:digit:]]*0 " file | awk '\''{print $4, $1, $3 }'\''' >/dev/null real 0m5.741s user 0m6.668s sys 0m0.112s 
+3
source

Here's the equivalent in PHP:

 $ time php -R 'if(preg_match("/hit \d*0 /", $argn)){$f=preg_split("/\s+/", $argn); echo $f[3]." ".$f[0]." ".$f[2];}' < file > /dev/null real 2m42.407s user 2m41.934s sys 0m0.355s 

compared to your awk :

 $ time awk -f tst.awk file > /dev/null real 0m3.271s user 0m3.165s sys 0m0.104s 

I tried a different approach in PHP, where I repeat the file manually, it makes things a lot faster, but I'm still not impressed:

tst.php

 <?php $fd=fopen('file', 'r'); while($line = fgets($fd)){ if(preg_match("/hit \d*0 /", $line)){ $f=preg_split("/\s+/", $line); echo $f[3]." ".$f[0]." ".$f[2]."\n"; } } fclose($fd); 

Results:

 $ time php tst.php > /dev/null real 0m27.354s user 0m27.042s sys 0m0.296s 
+2
source
  • first idea

     File.open("file").readlines.each do |line| line.gsub(/(hit)\s[0-9]*0\s+(.*?)\s+(.*)/) { puts "#{$3} #{$1} #{$2}" } end 
  • Second idea

     File.read("file").scan(/(hit)\s[[:digit:]]*0\s+(.*?)\s+(.*)/) { |f,s,t| puts "#{t} #{f} #{s}" } 

Trying to get something that can compare the answer, I ended up creating a github repo here . Each click on this repo starts building on travis-ci , which make up the markdown file, inserted one by one into the gh-pages branch to update a with an idea of ​​the build results.

Anyone who wants to take part can branch out the github rehearsal, add tests and complete the transfer request , which I will merge as soon as possible if it does not break the rest of the tests.

+2
source

mawk slightly faster than gawk .

 $ time bash -c 'mawk '\''/hit [[:digit:]]*0 / { print $4, $1, $3 }'\'' file | wc -l' 0 real 0m1.160s user 0m0.484s sys 0m0.052s $ time bash -c 'gawk '\''/hit [[:digit:]]*0 / { print $4, $1, $3 }'\'' file | wc -l' 100000 real 0m1.648s user 0m0.996s sys 0m0.060s 

(Only 1,000,000 lines in my source file. Best results for many displayed, although they were pretty consistent.)

+1
source

All Articles