What is the correct way to get a grap?

Why does it print a U , not ?

 #!/usr/bin/env perl use warnings; use 5.014; use utf8; binmode STDOUT, ':utf8'; use charnames qw(:full); my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}"; while ( $string =~ /(\X)/g ) { say $1; } # Output: U 
+7
source share
4 answers

This works for me, although I have an old version of perl, 5.012 , on ubuntu. My only change in your script: use 5.012;

 $ perl so.pl 脺 
+3
source

The correct code.

You really need to play these things by numbers; do not believe that "terminal" is displayed. Pass it through the uniquote program , possibly with -x or -v , and see what it really does.

Eyes are deceiving, and programs are even worse. Your terminal program is buggy and lies to you. Normalization should not matter.

 $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "cr猫me br没l茅e"' cr猫me br没l茅e $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "cr猫me br没l茅e"' | uniquote -x cr\x{E8}me br\x{FB}l\x{E9}e $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "cr猫me br没l茅e"' cre虁me bru虃le虂e $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "cr猫me br没l茅e"' | uniquote -x cre\x{300}me bru\x{302}le\x{301}e $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "cr猫me br没l茅e"' 茅el虃urb em虁erc $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "cr猫me br没l茅e")' | uniquote -x \x{E9}el\x{302}urb em\x{300}erc $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "cr猫me br没l茅e"' e虂el虃urb em虁erc $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "cr猫me br没l茅e"' | uniquote -x e\x{301}el\x{302}urb em\x{300}erc 
+8
source

Can I suggest a result that is wrong? Easy to verify: replace the loop code as follows:

 my $counter; while ( $string =~ /(\X)/g ) { say ++$counter, ': ', $1; } 

... and see how many times the regular expression matches. I think he will still match only once.

Alternatively, you can use this code:

 use Encode; sub codepoint_hex { sprintf "%04x", ord Encode::decode("UTF-8", shift); } 

... and then print codepoint_hex ($ 1) instead of just $ 1 inside the while loop.

+1
source

1) Apparently, your terminal cannot display extended characters. On my terminal, it prints:

 U篓 

2) \X does not do what you think. He just picks the characters that go together. If you use the string "fu\N{COMBINING DIAERESIS}r" , your program will display:

 f u篓 r 

Please note that the diacritical mark is not printed separately, but with its corresponding symbol.

3) To combine all related characters in one, use the Unicode :: Normalize module:

 use Unicode::Normalize; my $string = "fu\N{COMBINING DIAERESIS}r"; $string = NFC($string); while ( $string =~ /(\X)/g ) { say $1; } 

Displayed:

 f 眉 r 
+1
source

All Articles