Unicode Precomposition and Decomposition with Delphi

The Wikipedia entry for Subversion contains a paragraph about problems with various Unicode encoding methods:

While Subversion saves file names as Unicode, it does not indicate whether precomposition or decomposition is used for specific character accents (such as Γ©). Thus, files added to SVN clients running on some operating systems (such as OS X) use decomposition encoding, while clients running on other operating systems (such as Linux) use precomposition encoding, resulting in these accented characters are not displayed correctly if the local SVN client is not using the same encoding as the client used to add files

While this describes a specific problem with the implementation of the Subversion client, I am not sure if the emerging problem with Unicode composition can occur with regular Delphi applications. I assume that the problem can only occur if Delphi applications can use both Unicode encoding methods (possibly in Delphi XE2). If so, what could Delphi developers do to avoid this?

+4
source share
3 answers

There is a minor display problem, since many fonts used in Windows will not display the decomposed form in an ideal way, using a combined character for both the letter and the diacritic. Instead, he returns to rendering the letter and, thereby applying a separate diacritic mark on top, which usually leads to a less visually pleasing, potentially curved grapheme.

However, this is not the issue that the Subversion bug referenced by the wiki refers to. In fact, it’s quite normal to check file names on SVNs that contain grouped or expanded character sequences; SVN does not know or care about composition; it simply uses Unicode code codes as is. As long as the backend file system leaves the file names in the same state in which they were inserted, everything is in order.

Windows and Linux have file systems that are equally blind to composition. Mac OS X unfortunately not. Both HFS + and UFS file systems perform "normalization to decomposition" before saving the name of the incoming file, so the file name you get will not necessarily be the same sequence of Unicode codes that you entered.

This behavior [IMO: crazy] confuses SVN and many other programs - when running on OS X. It is especially likely to bite because Apple accidentally chose decomposition (NFD) as its normalization form, while most of the rest of the world uses Composed (NFC) Symbols.

(And it's not even a real NFD, but an incompatible version of Apple. Joy.)

The best way to handle this is to, if possible, never rely on the exact names of the files that are stored. If you only ever read a file with a given name, that’s fine, as it will be normalized to match the file system at that time. But if you read the list of directories and try to match the names of the files you find there, then what you expected from the file name is what Subversion does - you will get inconsistencies.

To correctly match the file name, you will need to detect that you are running OS X, and manually normalize both the file name and the string to some normal form (NFC or NFD) before performing the comparison. You should not do this on other operating systems that treat the two forms as different.

+6
source

AFAICT, both encodings should show the same results when displayed, and both are valid Unicode, so I don't see a problem there. The mapping procedure should be able to handle both options if decomposition is encountered. The code point Γ© should be displayed as is, while eΒ΄ should be displayed only as Γ© in decomposition mode.

The problem is not displayed, IMO, this is a comparison, either for equality (which does not work if both use a different encoding), or lexically, i.e. for sorting. That's why you need to normalize one encoding, as David says. Thus, there are no more obscurations.

+1
source

The same problem can occur in any application that deals with text. How to avoid this depends on what operations the application performs, and there are no specific details on the task. Basically, I think that you would solve such problems by normalizing the text. This involves using one preferred representation whenever you encounter coding ambiguity.

0
source

All Articles