What is the correct way to get a grap?

Question

What is the correct way to get a grap?

Why does it print a U , not Ü ?

 #!/usr/bin/env perl use warnings; use 5.014; use utf8; binmode STDOUT, ':utf8'; use charnames qw(:full); my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}"; while ( $string =~ /(\X)/g ) { say $1; } # Output: U

+7

regex perl unicode grapheme

sid_com Feb 24 '12 at 10:10

source share

4 answers

The correct code.

You really need to play these things by numbers; do not believe that "terminal" is displayed. Pass it through the uniquote program , possibly with -x or -v , and see what it really does.

Eyes are deceiving, and programs are even worse. Your terminal program is buggy and lies to you. Normalization should not matter.

 $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"' crème brûlée $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"' | uniquote -x cr\x{E8}me br\x{FB}l\x{E9}e $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' crème brûlée $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' | uniquote -x cre\x{300}me bru\x{302}le\x{301}e $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée"' éel̂urb em̀erc $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée")' | uniquote -x \x{E9}el\x{302}urb em\x{300}erc $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"' éel̂urb em̀erc $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"' | uniquote -x e\x{301}el\x{302}urb em\x{300}erc

+8

tchrist Feb 24 '12 at 12:02

source share

Can I suggest a result that is wrong? Easy to verify: replace the loop code as follows:

 my $counter; while ( $string =~ /(\X)/g ) { say ++$counter, ': ', $1; }

... and see how many times the regular expression matches. I think he will still match only once.

Alternatively, you can use this code:

 use Encode; sub codepoint_hex { sprintf "%04x", ord Encode::decode("UTF-8", shift); }

... and then print codepoint_hex ($ 1) instead of just $ 1 inside the while loop.

+1

raina77ow Feb 24 '12 at 10:49

source share

1) Apparently, your terminal cannot display extended characters. On my terminal, it prints:

U¨

2) \X does not do what you think. He just picks the characters that go together. If you use the string "fu\N{COMBINING DIAERESIS}r" , your program will display:

 f u¨ r

Please note that the diacritical mark is not printed separately, but with its corresponding symbol.

3) To combine all related characters in one, use the Unicode :: Normalize module:

 use Unicode::Normalize; my $string = "fu\N{COMBINING DIAERESIS}r"; $string = NFC($string); while ( $string =~ /(\X)/g ) { say $1; }

Displayed:

 f ü r

+1

Stamm Feb 24 '12 at 10:51

source share

beerbajay · Accepted Answer · 2012-02-24T10:38:51+0000

This works for me, although I have an old version of perl, 5.012 , on ubuntu. My only change in your script: use 5.012;

 $ perl so.pl Ü

What is the correct way to get a grap?

More articles: