How do I make Perl DWIM code with UTF8?

The utf8 pragma and utf8 encodings on file descriptors confuse me. For example, this clearly simple code ...

use utf8; print qq[fü]; 

To be clear, the hexadecimal dump on "fü" is 66 c3 bc , which, if I'm not mistaken, is the correct UTF8.

This prints 66 fc , which is not UTF8, but Unicode or possibly Latin-1. Turn off use utf8 and I will get 66 c3 bc . This is the opposite of what I expect.

Now add pramgas filehandle to the file.

 use utf8; binmode *STDOUT, ':encoding(utf8)'; print qq[fü]; 

Now I get 66 c3 bc . But remove use utf8 and I get 66 c3 83 c2 bc , which makes no sense to me.

What is the right thing to do my DWIM code with UTF8?

PS My locale is set to "en_US.UTF-8" and Perl 5.10.1.

+7
perl utf-8
source share
2 answers

use utf8; indicates that your source code is encoded in UTF8. By adding

 binmode *STDOUT, ':encoding(utf8)'; print qq[fü]; 

you ask that the output of the script also be encoded in UTF8.

If you wrote

 print "f\x{00FC}\n"; 

you would not need to use utf8; .

+6
source share

use utf8; just indicates that your source code (including string literals) is in UTF-8. You also need to set the encoding of your input and output streams.

You probably want to set the variable PERL_UNICODE in your environment. I installed it in SAL , which is broken as follows:

  • S STDIN / STDOUT / STDERR are UTF-8
  • A @ARGV - UTF-8
  • L , but only in the UTF-8 locale

See PERL_UNICODE and the -C option in perlrun .

You can also use open pragma to set the default encoding.

If you do this in a module that you distribute to others, you probably want to

 use open ':locale'; 

therefore, it will not unexpectedly enable UTF-8 for people who do not use the UTF-8 language standard.

0
source share

All Articles