Perl ord and chr work with unicode

To my horror, I just found out that chr does not work with Unicode, although it does something. The manual page is not clear.

Returns the character represented by this NUMBER character set. For example, chr (65) "is" A "in ASCII or Unicode, and chr (0x263a) is a Unicode emoticon.

Indeed, I can print a smiley using

 perl -e 'print chr(0x263a)' 

but things like chr(0x00C0) don't work. I see that my perl v5.10.1 is a little ancient, but when I paste different strange letters into the source code, everything is fine.

I tried funny things like use utf8 and use encoding 'utf8' , I didn't try funny things like use v5.12 and use feature 'unicode_strings' since they don't work with my version, I cheated with Encode::decode to find out that I do not need to decode, since I do not have a byte array to decode. I read a lot more documentation than ever before and found many interesting things, but nothing useful. This is similar to Unicode Bug , but there is no useful solution. Moreover, I do not care about the semantics of the whole line, all I need is a trivial function.

So, how can I convert a number to a string consisting of one character corresponding to it, so that, for example, real_chr(0xC0) eq 'Γ€' matters?


The first answer I received explains everything about IO, but I still don't understand why

 #!/usr/bin/perl -w use strict; use utf8; use encoding 'utf8'; print chr(0x00C0) eq 'Γ€' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n"; print 'Γ€' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n"; 

prints

 ne1 - eq1 match1 - no_match2 

This means that manually entered 'Γ€' differs from chr(0x00C0) . Moreover, the former is an integral symbol of the word (right!), While the latter is not (but must be!).

+6
source share
1 answer

At first,

 perl -le'print chr(0x263A);' 

is a mistake. Perl even tells you so much:

 Wide character in print at -e line 1. 

It does not qualify as "working." Therefore, although they differ in that they cannot provide what you want, none of the following gives you what you want:

 perl -le'print chr(0x263A);' perl -le'print chr(0x00C0);' 

To correctly output the UTF-8 encoding of these Unicode code points, you need to tell Perl to encode Unicode points with UTF-8.

 $ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);' ☺ $ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);' Γ€ 

Now about the why.

A file descriptor can only transmit bytes, so unless you specify otherwise, Perl files process bytes. This means that the line you provide print cannot contain anything but bytes, or, in other words, it cannot contain characters greater than 255. The output is exactly what you provide:

 $ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1 0000000 00 65 c0 f0 0000004 

This is useful. This is different from what you want, but it does not do it wrong. If you need something else, you just need to tell Perl what you want.

Adding a layer :encoding , the handle now expects a Unicode character string or, as I call it, "text". The level tells Perl how to convert text to bytes.

 $ perl -e' use open ":std", ":encoding(UTF-8)"; print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a; ' | od -t x1 0000000 00 65 c3 80 c3 b0 e2 98 ba 0000011 

It’s your right that chr does not know and does not care about Unicode. Like length , substr , ord and reverse , chr implements a basic string function, not a Unicode function. This does not mean that it cannot be used to work with a text string. As you saw, the problem was not in chr , but in what you did with the string after it was created.

A character is an element of a string, and a character is a number. This means that a string is just a sequence of numbers. Regardless of whether you treat these numbers as Unicode code points (text), packed IP addresses or temperature measurements are completely up to you and the functions to which you pass the strings.

Here are a few examples of statements that assign values ​​to strings that they receive as operands:

  • m// expects a Unicode code string.
  • connect expects a sequence of bytes representing the sockaddr_in structure.
  • print with no descriptor :encoding expect a sequence of bytes.
  • print with a handle :encoding expects a sequence of Unicode codes.
  • etc.

So, how can I convert a number to a string consisting of one character corresponding to it, so that, for example, real_chr (0xC0) has the value eq 'Γ€'?

chr(0xC0) eq 'Γ€' is satisfied. Don't you remember that Perl encoded the source code using UTF-8 using use utf8; ? If you did not specify Perl, Perl actually sees a two-digit string in RHS.


Regarding the question you added:

There are problems with the encoding pragma. I recommend not using it. Use instead

 use open ':std', ':encoding(UTF-8)'; 

This will fix one of the problems. Another problem you are facing is

 chr(0x00C0) =~ /\w/ 

This is a known bug that intentionally crashed due to backward compatibility considerations. That is, if you do not request a newer version of the language as follows:

 use 5.014; # use 5.012; *might* suffice. 

Workaround that works as early as 5.8:

 my $x = chr(0x00C0); utf8::upgrade($x); $x =~ /\w/ 
+11
source

Source: https://habr.com/ru/post/924683/


All Articles