How to sort by string length and then vice versa

I have a large (600 odd) set of search and replacement terms that I need to run as a sed script for some files. The problem is that the search terms are NOT orthogonal ... but I think I can get away from it by sorting by the length of the string (that is, first pull out the longest matches, and then alphabetically within each length. Therefore unsorted set specified

aaba aa ab abba bab aba 

what I want is a sorted set, for example:

 abba aaba bab aba ab aa 

Is there a way to do this by, say, adding a line length and sorted by field?

For bonus points :-) !!! Searching and replacing is really just a case of replacing the term with _term_ and the sed code that I was going to use was c / term / _term_ / g. How to write a regular expression so as not to replace terms already within pairs?

+4
source share
6 answers

You can compress all this into one regular expression:

 $ sed -e 's/\(aaba\|aa\|abba\)/_\1_/g' testing words aa, aaba, abba. testing words _aa_, _aaba_, _abba_. 

If I understand your question correctly, this will solve all your problems: "double replacement" and always matches the longest word.

+2
source

You can do this in a single line Perl script:

 perl -e 'print sort { length $b<=>length $a || $b cmp $a } <>' input 
+10
source
 $ awk '{print length($1),$1}' file |sort -rn 4 abba 4 aaba 3 bab 3 aba 2 ab 2 aa 

I leave you to try to get rid of the first column myself

+2
source

Just pass your thread through this kind of script:

 #!/usr/bin/python import sys all={} for line in sys.stdin: line=line.rstrip() if len(line) in all: all[len(line)].append(line) else: all[len(line)]=[line] for l in reversed(sorted(all)): print "\n".join(reversed(sorted(all[l]))) 

And for the bonus brand question: once again, do it in python (unless there really is a reason, but I would be very interested to know this)

+1
source

This will sort the file by the length of the line, the longest lines:

 cat file.txt | (while read LINE; do echo -e "${#LINE}\t$LINE"; done) | sort -rn | cut -f 2- 

This will replace term with _term_ , but will not turn _term_ into __term__ :

 sed -r 's/(^|[^_])term([^_]|$)/\1_term_\2/g' sed -r -e 's/(^|[^_])term/\1_term_/g' -e 's/term([^_]|$)/_term_\1/g' 

The first will work very well, except that it will skip _term and term_ , mistakenly leaving them alone. Use the second if it is important. Here is my silly test case:

 # echo here is _term_ and then a term you terminator haha _terminator and then _term_inator term_inator | sed -re 's/(^|[^_])term([^_]|$)/\1_term_\2/g' here is _term_ and then a _term_ you _term_inator haha _terminator and then _term_inator term_inator # echo here is _term_ and then a term you terminator haha _terminator and then _term_inator term_inator | sed -r -e 's/(^|[^_])term/\1_term_/g' -e 's/term([^_]|$)/_term_\1/g' here is _term_ and then a _term_ you _term_inator haha __term_inator and then _term_inator _term__inator 
0
source

First sort by length, then inverse alpha bit

 for mask in `tr -c "\n" "." < $FILE | sort -ur` do grep "^$mask$" $FILE | sort -r done 

Using tr replaces every character in $FILE period - which matches any single character in grep .

0
source

All Articles