Unicode normalization in Python: is it right to translate u '\ xb4' to u '\ u0301'

Question

Unicode normalization in Python: is it right to translate u '\ xb4' to u '\ u0301'

view the following snippet:

>>> import unicodedata >>> from unicodedata import normalize, name >>> normalize('NFKD', u'\xb4') u' \u0301' >>> normalize('NFKD', u'a\xb4a') u'a \u0301a' >>> normalize('NFKC', u'a\xb4a') u'a \u0301a' >>> name(u'\xb4'), name(u'\u0301') ('ACUTE ACCENT', 'COMBINING ACUTE ACCENT')

I am trying to figure out if the behavior is correct for translating u'\xb4' to u' \u0301' . Why does it combine a sharp accent with space? Why does this translate u \xb4 ?

In fileformat we see that ACUTE ACCENT was called SPACING ACUTE . I thought this meant that the cursor should move and not wait for the next character to be entered.

UPD: in case anyone is interested, here is a list if Unicode characters that after the NFKC normalization take place at the beginning: http://pastebin.com/Z99r5AK9

+8

python unicode

newtover Dec 19 '12 at 14:48

source share

3 answers

Take a look at the Unicode Collation Algorithm . In particular, we note that

Normalization of compatibility (NFKC) reduces stand-alone accents to a combination of space + a combination of emphasis.

+4

borrible Dec 19 '12 at 14:56

source share

In NFKD, accented characters are stored in a “divided” way: first the character to be accented, and then the combination accent: u' \u0301'

In NFKC, accented characters are stored in a “combined” way: there is a dedicated Unicode code point: u'\xb4' , which is an abbreviation for u'\u00b4' .

Both of them represent only one accent, which can be considered as an accent over a spatial symbol.

+3

glglgl Dec 19 '12 at 14:58

source share

Martijn pieters · Accepted Answer · 2012-12-19T14:51:05+0000

An accent symbol is a combination of space and a combining accent symbol, as specified in the Unicode standard:

 >>> import unicodedata >>> unicodedata.decomposition(u'\xb4') '<compat> 0020 0301'

The symbol \u00B4 has a somewhat mixed history, but the Unicode standard decided to treat it as a space + accent, although it was often used as just a diacritic, see this discussion .

You could use \u02CA as an alternative; it is not considered as a space and has no decomposition. Instead, it qualifies as a letter, so your mileage may vary.

Unicode normalization in Python: is it right to translate u '\ xb4' to u '\ u0301'

More articles: