Unicode normalization in Python: is it right to translate u '\ xb4' to u '\ u0301'

view the following snippet:

>>> import unicodedata >>> from unicodedata import normalize, name >>> normalize('NFKD', u'\xb4') u' \u0301' >>> normalize('NFKD', u'a\xb4a') u'a \u0301a' >>> normalize('NFKC', u'a\xb4a') u'a \u0301a' >>> name(u'\xb4'), name(u'\u0301') ('ACUTE ACCENT', 'COMBINING ACUTE ACCENT') 

I am trying to figure out if the behavior is correct for translating u'\xb4' to u' \u0301' . Why does it combine a sharp accent with space? Why does this translate u \xb4 ?

In fileformat we see that ACUTE ACCENT was called SPACING ACUTE . I thought this meant that the cursor should move and not wait for the next character to be entered.

UPD: in case anyone is interested, here is a list if Unicode characters that after the NFKC normalization take place at the beginning: http://pastebin.com/Z99r5AK9

+8
python unicode
source share
3 answers

An accent symbol is a combination of space and a combining accent symbol, as specified in the Unicode standard:

 >>> import unicodedata >>> unicodedata.decomposition(u'\xb4') '<compat> 0020 0301' 

The symbol \u00B4 has a somewhat mixed history, but the Unicode standard decided to treat it as a space + accent, although it was often used as just a diacritic, see this discussion .

You could use \u02CA as an alternative; it is not considered as a space and has no decomposition. Instead, it qualifies as a letter, so your mileage may vary.

+11
source share

Take a look at the Unicode Collation Algorithm . In particular, we note that

Normalization of compatibility (NFKC) reduces stand-alone accents to a combination of space + a combination of emphasis.

+4
source share

In NFKD, accented characters are stored in a “divided” way: first the character to be accented, and then the combination accent: u' \u0301'

In NFKC, accented characters are stored in a “combined” way: there is a dedicated Unicode code point: u'\xb4' , which is an abbreviation for u'\u00b4' .

Both of them represent only one accent, which can be considered as an accent over a spatial symbol.

+3
source share

All Articles