This question was discussed here in Meta and my answer is to give a link to a test system to answer this.
The question often arises whether to use gawk or mawk or C or some other language because of performance, so create a canonical question / answer for a trivial and typical awk program.
The result of this will be the answer that provides a comparison of the performance of various tools that perform the basic tasks of text processing matching regular expressions and splitting fields into a simple input file. If tool X is twice as fast as any other tool for this task, then this is useful information. If all the tools take about the same amount of time, then this is also useful information.
How this will work is that over the next few days, many people will participate in the “answers”, which are proven programs, and then one person (volunteers?) Will test them all on the same platform (or several people will test some subsets on our platform so that we can compare), and then all the results will be collected in one answer.
Given the 10 million line input file created by this script:
$ awk 'BEGIN{for (i=1;i<=10000000;i++) print (i%5?"miss":"hit"),i," third\t \tfourth"}' > file $ wc -l file 10000000 file $ head -10 file miss 1 third fourth miss 2 third fourth miss 3 third fourth miss 4 third fourth hit 5 third fourth miss 6 third fourth miss 7 third fourth miss 8 third fourth miss 9 third fourth hit 10 third fourth
and given this awk script, which prints the fourth and then the 1st and then the 3rd field of each line, starting with "hit", followed by an even number:
$ cat tst.awk /hit [[:digit:]]*0 / { print $4, $1, $3 }
Here are the first 5 lines of the expected output:
$ awk -f tst.awk file | head -5 fourth hit third fourth hit third fourth hit third fourth hit third fourth hit third
and here is the result when it is connected to the second awk script, to make sure that the main script above actually functions exactly as intended:
$ awk -f tst.awk file | awk '!seen[$0]++{unq++;r=$0} END{print ((unq==1) && (seen[r]==1000000) && (r=="fourth hit third")) ? "PASS" : "FAIL"}' PASS
The following are the results of synchronizing the third run of gawk 4.1.1, running in bash 4.3.33 on cygwin64:
$ time awk -f tst.awk file > /dev/null real 0m4.711s user 0m4.555s sys 0m0.108s
Please note that the above is the third execution to remove differences in caching.
Can anyone provide equivalent C, perl, python, any code for this:
$ cat tst.awk /hit [[:digit:]]*0 / { print $4, $1, $3 }
i.e. look for THET REGEXP in the line (we are not looking for any other solution that works around the need for regular expression), split the line on each series of adjacent spaces and print the fourth, then 1st, then 3rd fields, separated by a single empty char?
If so, we can test them on the same platform to see / record performance differences.
Code entered so far:
AWK (can be tested against gawk, etc., but mawk, nawk, and possibly others will require [0-9] instead of [: digit:])
awk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file
Php
php -R 'if(preg_match("/hit \d*0 /", $argn)){$f=preg_split("/\s+/", $argn); echo $f[3]." ".$f[0]." ".$f[2];}' < file
shell
egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}' grep --mmap -E "^hit [[:digit:]]*0 " file | awk '{print $4, $1, $3 }'
Ruby
$ cat tst.rb File.open("file").readlines.each do |line| line.gsub(/(hit)\s[0-9]*0\s+(.*?)\s+(.*)/) { puts "#{$3} #{$1} #{$2}" } end $ ruby tst.rb
Perl
$ cat tst.pl
Python
none yet
FROM
none yet