Check if two lines start with the same character, if that means the average value of the result, if not, displays the actual value

I would like to check if two rows start with the same number in the 1st column, if that happens then the average value of the second column should be displayed. Example file:

01 21 6 10% 93.3333% 01 22 50 83.3333% 93.3333% 02 20.5 23 18.1102% 96.8504% 02 21.5 100 78.7402% 96.8504% 03 22.2 0 0% 100% 03 21.2 29 100% 100% 04 22.5 1 5.55556% 100% 04 23.5 17 94.4444% 100% 05 22.7 9 7.82609% 100% 05 21.7 106 92.1739% 100% 06 23 11 17.4603% 96.8254% 06 22 50 79.3651% 96.8254% 07 20.5 14 18.6667% 96% 07 21.5 58 77.3333% 96% 08 21.8 4 100% 100% 09 22.6 0 0% 100% 09 21.6 22 100% 100% 

For example, the first two lines begin with 01 , but there is only one line starting with 08 (15th line). Therefore, a result based on these two cases should be:

 01 21.5 ... ... ... 08 21.8 ... ... ... 

I ended up with the following awk line, which works fine when the file always has two similar lines, but it doesn't work using the file shown above (because of the 15th line):

 awk '{sum+=$2} (NR%2)==0{print sum/2; sum=0;}' 

Any hints are welcome,

+6
source share
4 answers

This awk should work:

 awk 'function dump(){if (n>0) printf "%s%s%.2f\n", p, OFS, sum/n} NR>1 && $1 != p{dump(); sum=n=0} {p=$1; sum+=$2; n++} END{dump()}' file 01 21.5 02 21.0 03 21.7 04 23.0 05 22.2 06 22.5 07 21.0 08 21.8 09 22.1 

Explanation: We use 3 variables:

 p -> to hold previous row $1 value n -> count of similar $1 values sum -> is sum of $2 values for similar $1 rows 

How it works:

 NR>1 && $1 != p # when row #1 > 1 and prev $1 is not current $1 dump() # function is to print formatted value of $1 and average p=$1; sum+=$2; n++ # sets p to $1, adds current $2 to sum and increments n 
+4
source

Using GNU awk

 gawk ' {sum[$1]+=$2; n[$1]++} END { PROCINFO["sorted_in"] = "@ind_num_asc" for (key in sum) print key, sum[key]/n[key] } ' file 
 01 21.5 02 21 03 21.7 04 23 05 22.2 06 22.5 07 21 08 21.8 09 22.1 

The PROCINFO line allows you to bypass an array to sort my index numerically. Otherwise, the output will be random.

+4
source

awk channel sorted

 awk '{s[$1]+=$2;c[$1]++} END{for(i in s) print i, s[i]/c[i]}' file | sort 
+1
source
 awk ' second{ if($1 == first){ print (second + $2) / 2 second = 0 next } else print second } { printf "%s ", $1 fist = $1 second = $2 } END{ if(second) print second }' file 
+1
source

All Articles