Awk Script Issues

Question

Awk Script Issues

I'm trying to use awk to parse a table with tab delimiters - there are several duplicate entries in the first column, and I need to remove duplicate rows that have a smaller total from the remaining 4 columns in the table. I can easily remove the first or second row and summarize the columns, but I have problems joining the two. For my purposes there will be no more than two duplicates.

Example file: http://pastebin.com/u2GBnm2D

The desired output in this case would be to delete the lines:

lmo0330 1       1       0       1
lmo0506 7       21      2       10

And save the other two rows with the same gene identifier in the column. The final parsed file will look like this: http://pastebin.com/WgDkm5ui

Here is what I tried (it does nothing, but the first part removes the second duplicate, and the second part summarizes the calculations):

awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'

I tried changing the second part of the script in a better answer to this question: Removing lines containing a unique first field with awk?

awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile

But, unfortunately, I do not quite understand what is going on well enough to make it work. Can someone help me? I think I need to replace the part a[$1] > 1with [remove (the first counter duplicate or the second duplicate, whichever is greater).

EDIT: I also use GNU Awk 3.1.7, if that matters.

+4

linux bash awk

1225 Jul 12 '15 at 6:22

source share

1 answer

anubhava · Accepted Answer · 2015-07-12T07:56:22+0000

You can use this command awk:

awk 'NR == 1 {
   print;
   next
} {
   s = $2+$3+$4+$5
} s >= sum[$1] {
   sum[$1] = s;
   if (!($1 in rows))
      a[++n] = $1;
   rows[$1] = $0
} END {
   for(i=1; i<=n; i++)
      print rows[a[i]]
}' file | column -t

Conclusion:

gene     SRR034450.out.rpkm_0  SRR034451.out.rpkm_0  SRR034452.out.rpkm_0  SRR034453.out.rpkm_0
lmo0001  160                   323                   533                   293
lmo0002  135                   317                   504                   306
lmo0003  1                     4                     5                     3
lmo0004  35                    59                    58                    48
lmo0005  113                   218                   257                   187
lmo0006  279                   519                   653                   539
lmo0007  563                   1053                  1165                  1069
lmo0008  34                    84                    203                   107
lmo0009  13                    45                    90                    49
lmo0010  57                    210                   237                   169
lmo0011  65                    224                   247                   179
lmo0012  65                    226                   250                   215
lmo0013  342                   500                   738                   682
lmo0014  662                   1032                  1283                  1311
lmo0015  321                   413                   631                   637
lmo0016  175                   253                   273                   325
lmo0017  3                     6                     6                     6
lmo0018  33                    38                    46                    45
lmo0019  13                    1                     39                    1
lmo0020  3                     12                    28                    15
lmo0021  3                     4                     14                    12
lmo0022  2                     3                     5                     1
lmo0023  2                     0                     3                     2
lmo0024  1                     0                     2                     6
lmo0330  1                     1                     1                     3
lmo0506  151                   232                   60                    204

Awk Script Issues

More articles: