Perl uses pragma encoding to break UTF strings

I have a problem with the Perl and Encoding pragmas.

(I use utf-8 everywhere, on input, on output, perl scripts themselves. I do not want to use another encoding ever.)

But. When I write

binmode(STDOUT, ':utf8'); use utf8; $r = "\x{ed}"; print $r; 

I see the line Γ­ "(this is what I want - and what is U + 00ED unicode char). But when I add the pragma" use encoding "like this

 binmode(STDOUT, ':utf8'); use utf8; use encoding 'utf8'; $r = "\x{ed}"; print $r; 

all i see is a window symbol. Why?

Also, when I add Data :: Dumper and the daemon prints a new line like this

 binmode(STDOUT, ':utf8'); use utf8; use encoding 'utf8'; $r = "\x{ed}"; use Data::Dumper; print Dumper($r); 

I see that perl changed the line to "\x{fffd}" . Why?

+7
source share
2 answers

use encoding 'utf8' . Instead of interpreting \x{ed} as a U + 00ED code point, he interprets it as a single byte 237, and then tries to interpret it as UTF-8. Which, of course, fails, so it replaces it with the replacement character U + FFFD, literally "".

Just press use utf8 to indicate that your source is in UTF-8 and binmode or open a pragma to specify the encoding for your files.

+9
source

For your real code, neither use encoding nor use utf8 required for proper operation - the only thing it depends on is the encoding level on STDOUT .

 binmode(STDOUT, ":utf8"); print "\xed"; 

- It is an equally valid full program that does what you want.

use utf8 should only be used if you have UTF-8 in literal lines in your program - for example, if you wrote

 my $r = "Γ­"; 

then use utf8 will cause this string to be interpreted as the only character U + 00ED instead of the C3 AD byte series.

use encoding should never be used, especially for those who love Unicode. If you want the stdin / out encoding to be changed, you must use -C or PERLUNICODE or binmode them yourself, and if you want other descriptors to open automatically with encoding levels, you must use open .

+5
source

All Articles