I'm trying to use awk to parse a table with tab delimiters - there are several duplicate entries in the first column, and I need to remove duplicate rows that have a smaller total from the remaining 4 columns in the table. I can easily remove the first or second row and summarize the columns, but I have problems joining the two. For my purposes there will be no more than two duplicates.
Example file: http://pastebin.com/u2GBnm2D
The desired output in this case would be to delete the lines:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And save the other two rows with the same gene identifier in the column. The final parsed file will look like this: http://pastebin.com/WgDkm5ui
Here is what I tried (it does nothing, but the first part removes the second duplicate, and the second part summarizes the calculations):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried changing the second part of the script in a better answer to this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But, unfortunately, I do not quite understand what is going on well enough to make it work. Can someone help me? I think I need to replace the part a[$1] > 1with [remove (the first counter duplicate or the second duplicate, whichever is greater).
EDIT: I also use GNU Awk 3.1.7, if that matters.
source
share