Why doesn't AWK treat this array index as a number if I don't use int ()?

Question

Why doesn't AWK treat this array index as a number if I don't use int ()?

I have genomics files of the following type:

$ cat test-file_long.txt 2 41647 AG 2 45895 AG 2 45953 TC 2 224919 AG 2 230055 CG 2 233239 AG 2 234130 TG 2 23454 TC

When I use the following short AWK script, it does not return all elements that are larger than the element used in the if statement:

 { a[$2] } END{ for (i in a){ if(i > 45895) print i } }

The script returns this:

 $ awk -f practice.awk test-file_long.txt 45953

However, when I modify the if statement using the int () function, it returns strings that are actually more than what I want:

 { a[$2] } END{ for (i in a){ if(int(i) > 45895) print i } }

Result:

 $ awk -f practice.awk test-file_long.txt 233239 230055 234130 224919 45953

It seems that he does only a comparison with the first digit, and if they are the same, then he looks at the next digit, but does not process the integer. Can someone explain to me what this means about the internal mechanism of the associative array, that it does not make a numerical> / <comparison, unless I specify that I want int () of the array element? What if my array elements were float and int () was not an option?

+6

arrays bash awk

isosceleswheel Apr 24 '14 at 15:13

source share

1 answer

Tom fenech · Accepted Answer · 2014-04-24T15:16:22+0000

The array keys in awk are strings, so an alphabetical comparison is performed here. In the first example, 459 greater than 458 in alphabetical order, so it passes the test.

If your only goal is to print lines whose second column > 45895 numerically, this would do:

 awk '$2 > 45895' test-file_long.txt

Variables vary depending on the context in which they are evaluated. Therefore, placing a variable in an explicitly numerical context, it will be considered as such. The @glenn i+0 sentence demonstrates this perfectly.

Alternatively, the unary plus +i operator can be used to convert an expression to a number. Therefore, your longer example can be changed to:

 awk '{a[$2]} END { for (i in a) { if (+i > 45895) print i } }' test-file_long.txt

Why doesn't AWK treat this array index as a number if I don't use int ()?

More articles: