How to sort tab File format based on K column length

I have a delimited table file that looks like this:

>NODE 28 length 23 cov 11.043478 ACATCCCGTTACGGTGAGCCGAAAGACCTTATGTATTTTGTGG >NODE 32 length 21 cov 13.857142 ACAGATGTCATGAAGAGGGCATAGGCGTTATCCTTGACTGG >NODE 33 length 28 cov 14.035714 TAGGCGTTATCCTTGACTGGGTTCCTGCCCACTTCCCGAAGGACGCAC 

How can I use Unix sort to sort by DNA sequence length [ATCG]?

+4
source share
4 answers

This pipeline command will determine the length as well. My Unix is โ€‹โ€‹a little rusty, doing other things for a while

 $ awk '{printf("%d %s\n", length($NF), $0)}' junk.lst|sort -n -k1,1|sed 's/^[0-9]* //' 
+3
source

If the length is in the 4th column, sort -n -k4 should do the trick.

If the answer is to determine the length, then you are looking for a preprocessing step before sorting. Maybe python, which just prints the length of a partition divided by 7th space, like the last or first column.

+6
source
  awk '{print length($NF) $0|"sort -n"}' file | sed 's/^.[^>]*>/>/' 
+1
source

With Perl:

 perl -e' print sort { length +($a =~ /(\S+)$/)[0] <=> length +($b =~ /(\S+)$/)[0] } <>' infile 

With GNU awk:

 WHINY_USERS= gawk 'END { for (L in l) print l[L] } { l[sprintf("%15s", length($NF))] = $0 }' infile 
+1
source

Source: https://habr.com/ru/post/1313594/


All Articles