Awk åäö umlaut-chars has a length of 2

I use awk (mac os x) to print only lines containing n characters or longer.

If I try it in a text file (strings.txt) that looks like this:

four
foo
bar
föö
bår
fo
ba

And I ran this awk script:

awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt 

Conclusion:

four
foo
bar
föö
bår

(The last two lines should not be printed). Words that contain umlaut characters (å, ä, ö ...) seem to be considered two characters.

(The input file is saved in UTF8 format.)

+5
source share
3 answers

Try setting the locale:

LC_ALL=en_US.UTF-8 awk 'length >= 3' infile

Change en_US.UTF-8 to the correct locale.

+4
source

BSD awk (aka BWK awk), macOS (- macOS 10.13), - - Unicode.

:

  • IF, , , , ISO-8859-1, iconv :

    iconv -f UTF-8 -t ISO-8859-1 file | awk 'length >= 3' | iconv -f ISO-8859-1 -t UTF-8
    
  • awk, Unicode, gawk (GNU Awk) mawk; , Homebrew:
    • brew info gawk
    • brew info mawk
  • , Unicode, sed:

    sed -n '/^.\{3,\}/p' file
    
+3

try the following:

$  echo "four
foo
bar
föö
bår
fo
ba
fö
bå
"|awk ' {x=$0;gsub(/./,"x",x); if( length(x) >= 3 ) print $0 } ' 

Output

four
foo
bar
föö
bår
0
source

All Articles