Why should you recompose Unicode (NFC) at the output?

TomC recommends decomposing Unicode characters into paths and rearranging the output ( http://www.perl.com/pub/2012/04/perl-unicode-cookbook-always-decompose-and-recompose.html ).

The first one makes perfect sense for me, but I don’t understand why he recommends redoing the output. Potentially, you could save a small amount of space if your text is heavy with characters with a European accent, but you just press this on another decomposition function.

Are there any other obvious reasons why I am missing?

+8
perl unicode
source share
5 answers

As Ven'Tatsu writes in the commentary, there is software that can process arranged characters but not decomposed characters. Although the opposite is theoretically possible, I have never seen it in practice and expected it to be rare.

To simply display an expanded character, a rendering program must deal with combining diacritics. It is not enough to find them in the font. The renderer should correctly position the diacritic using the size information for the base character. Often there are problems with this, which leads to poor rendering, especially if the rendering uses diacritics from a different font! The result can hardly be better than what is achieved by simply displaying the symbol of a pre-compositional symbol of the type "é" developed by the typographer.

(Rendering software can also analyze the situation and efficiently match the expanded character to the pre-compositional character, but this will require additional code.)

+5
source share

This is pretty simple: most tools are limited to Unicode support; they assume that the characters are in the form of NFC.

For example, usually people compare strings:

perl -CSDA -e"use utf8; if ($ARGV[0] eq "Éric") { ... }" 

And, of course, "É" is in NFC form (since it does almost everything), so this program only accepts arguments in NFC form.

+2
source share

This will simplify the use of text editors, as the end user will expect that one visible character will be one character, not several. It also prevents problems with systems that do not treat expanded characters as “single” characters.

Other than that, I see no particular advantage.

0
source share

You must have normalization so that all data has the same normalization, so why not choose a potentially shorter one?

Regarding yet another decomposition, remember that you want to be strict with what you output, but liberal with what you accept. :)

0
source share

Tom Christiansen is an active contributor to StackOverflow and answers many Perl questions. There is a good chance that he will answer this question.

Certain character sequences, such as ff , can be represented in UTF-8 as two Unicode characters f and f , or as one Unicode character ( ff ). When you decompose your characters, you do things like ff , become two separate characters that are important for sorting. You want it to be two separate letters f when sorting.

When you recompose UTF-8 f and f , they return to the same UTF-8 character, which will be important for display (you want them to be well formatted) and for editing (you want to edit it as one character).

Unfortunately, my theory falls apart for things like Spanish -. This is represented as U + 00F1 as a single character and decomposes into U + 006E (n) and U + 0303 (in-place ~). Perl may have built-in logic to handle this type of two UTF-8 markup characters.

-3
source share

All Articles