How to sort by string length and then vice versa

Question

How to sort by string length and then vice versa

I have a large (600 odd) set of search and replacement terms that I need to run as a sed script for some files. The problem is that the search terms are NOT orthogonal ... but I think I can get away from it by sorting by the length of the string (that is, first pull out the longest matches, and then alphabetically within each length. Therefore unsorted set specified

aaba aa ab abba bab aba

what I want is a sorted set, for example:

 abba aaba bab aba ab aa

Is there a way to do this by, say, adding a line length and sorted by field?

For bonus points :-) !!! Searching and replacing is really just a case of replacing the term with _term_ and the sed code that I was going to use was c / term / _term_ / g. How to write a regular expression so as not to replace terms already within pairs?

+4

sorting bash regex sed

Dycey Nov 03 '09 at 21:56

source share

6 answers

You can do this in a single line Perl script:

 perl -e 'print sort { length $b<=>length $a || $b cmp $a } <>' input

+10

mob Nov 03 '09 at 10:15

source share

 $ awk '{print length($1),$1}' file |sort -rn 4 abba 4 aaba 3 bab 3 aba 2 ab 2 aa

I leave you to try to get rid of the first column myself

+2

ghostdog74 Nov 04 '09 at 0:12

source share

Just pass your thread through this kind of script:

 #!/usr/bin/python import sys all={} for line in sys.stdin: line=line.rstrip() if len(line) in all: all[len(line)].append(line) else: all[len(line)]=[line] for l in reversed(sorted(all)): print "\n".join(reversed(sorted(all[l])))

And for the bonus brand question: once again, do it in python (unless there really is a reason, but I would be very interested to know this)

+1

Gyom Nov 03 '09 at 10:08

source share

This will sort the file by the length of the line, the longest lines:

 cat file.txt | (while read LINE; do echo -e "${#LINE}\t$LINE"; done) | sort -rn | cut -f 2-

This will replace term with _term_ , but will not turn _term_ into __term__ :

 sed -r 's/(^|[^_])term([^_]|$)/\1_term_\2/g' sed -r -e 's/(^|[^_])term/\1_term_/g' -e 's/term([^_]|$)/_term_\1/g'

The first will work very well, except that it will skip _term and term_ , mistakenly leaving them alone. Use the second if it is important. Here is my silly test case:

 # echo here is _term_ and then a term you terminator haha _terminator and then _term_inator term_inator | sed -re 's/(^|[^_])term([^_]|$)/\1_term_\2/g' here is _term_ and then a _term_ you _term_inator haha _terminator and then _term_inator term_inator # echo here is _term_ and then a term you terminator haha _terminator and then _term_inator term_inator | sed -r -e 's/(^|[^_])term/\1_term_/g' -e 's/term([^_]|$)/_term_\1/g' here is _term_ and then a _term_ you _term_inator haha __term_inator and then _term_inator _term__inator

0

John kugelman Nov 03 '09 at 10:02

source share

First sort by length, then inverse alpha bit

 for mask in `tr -c "\n" "." < $FILE | sort -ur` do grep "^$mask$" $FILE | sort -r done

Using tr replaces every character in $FILE period - which matches any single character in grep .

0

martin clayton Nov 03 '09 at 10:11

source share

Johannes Hoff · Accepted Answer · 2009-11-03T22:08:55+0000

You can compress all this into one regular expression:

 $ sed -e 's/\(aaba\|aa\|abba\)/_\1_/g' testing words aa, aaba, abba. testing words _aa_, _aaba_, _abba_.

If I understand your question correctly, this will solve all your problems: "double replacement" and always matches the longest word.

How to sort by string length and then vice versa

More articles: