Python.lower doesn't seem to have the correct representation of all Unicode characters (Python 2.7)

I am trying to normalize latin unicode characters in my lowercase forms. I thought I could use the .lower function for a unicode object, but it seems that some parts of Unicode are not covered - in particular, this block has some characters that are not lowercase: 0xa7a0, 0xa7a2, 0xa7b6, 0xa79e, 0xa79a, 0xa790 all return the same character when .lower is called. I do not expect to see many of these characters, so it would be a waste of time to go through all the relevant blocks in order to fix those that are not being converted correctly. Is there a more complete Unicode lowercase function, or is it a locale problem?

+6
source share
1 answer

The latest version of Python 2 (2.7.11) uses version 5.2.0 of the Unicode character database , while the latest version of Python 3 (3.5.1) uses version 8.0.0. Earlier versions of Python 3 use UCD version 6.xx

The difference between the tables between 5.2.0 and 8.0.0 shows the following list of missing characters in the range (A720-A7FF) that interest you:

-A78D;LATIN CAPITAL LETTER TURNED H;Lu;0;L;;;;;N;;;;0265; -A78E;LATIN SMALL LETTER L WITH RETROFLEX HOOK AND BELT;Ll;0;L;;;;;N;;;;; -A78F;LATIN LETTER SINOLOGICAL DOT;Lo;0;L;;;;;N;;;;; -A790;LATIN CAPITAL LETTER N WITH DESCENDER;Lu;0;L;;;;;N;;;;A791; -A791;LATIN SMALL LETTER N WITH DESCENDER;Ll;0;L;;;;;N;;;A790;;A790 -A792;LATIN CAPITAL LETTER C WITH BAR;Lu;0;L;;;;;N;;;;A793; -A793;LATIN SMALL LETTER C WITH BAR;Ll;0;L;;;;;N;;;A792;;A792 -A794;LATIN SMALL LETTER C WITH PALATAL HOOK;Ll;0;L;;;;;N;;;;; -A795;LATIN SMALL LETTER H WITH PALATAL HOOK;Ll;0;L;;;;;N;;;;; -A796;LATIN CAPITAL LETTER B WITH FLOURISH;Lu;0;L;;;;;N;;;;A797; -A797;LATIN SMALL LETTER B WITH FLOURISH;Ll;0;L;;;;;N;;;A796;;A796 -A798;LATIN CAPITAL LETTER F WITH STROKE;Lu;0;L;;;;;N;;;;A799; -A799;LATIN SMALL LETTER F WITH STROKE;Ll;0;L;;;;;N;;;A798;;A798 -A79A;LATIN CAPITAL LETTER VOLAPUK AE;Lu;0;L;;;;;N;;;;A79B; -A79B;LATIN SMALL LETTER VOLAPUK AE;Ll;0;L;;;;;N;;;A79A;;A79A -A79C;LATIN CAPITAL LETTER VOLAPUK OE;Lu;0;L;;;;;N;;;;A79D; -A79D;LATIN SMALL LETTER VOLAPUK OE;Ll;0;L;;;;;N;;;A79C;;A79C -A79E;LATIN CAPITAL LETTER VOLAPUK UE;Lu;0;L;;;;;N;;;;A79F; -A79F;LATIN SMALL LETTER VOLAPUK UE;Ll;0;L;;;;;N;;;A79E;;A79E -A7A0;LATIN CAPITAL LETTER G WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A1; -A7A1;LATIN SMALL LETTER G WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A0;;A7A0 -A7A2;LATIN CAPITAL LETTER K WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A3; -A7A3;LATIN SMALL LETTER K WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A2;;A7A2 -A7A4;LATIN CAPITAL LETTER N WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A5; -A7A5;LATIN SMALL LETTER N WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A4;;A7A4 -A7A6;LATIN CAPITAL LETTER R WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A7; -A7A7;LATIN SMALL LETTER R WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A6;;A7A6 -A7A8;LATIN CAPITAL LETTER S WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A9; -A7A9;LATIN SMALL LETTER S WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A8;;A7A8 -A7AA;LATIN CAPITAL LETTER H WITH HOOK;Lu;0;L;;;;;N;;;;0266; -A7AB;LATIN CAPITAL LETTER REVERSED OPEN E;Lu;0;L;;;;;N;;;;025C; -A7AC;LATIN CAPITAL LETTER SCRIPT G;Lu;0;L;;;;;N;;;;0261; -A7AD;LATIN CAPITAL LETTER L WITH BELT;Lu;0;L;;;;;N;;;;026C; -A7B0;LATIN CAPITAL LETTER TURNED K;Lu;0;L;;;;;N;;;;029E; -A7B1;LATIN CAPITAL LETTER TURNED T;Lu;0;L;;;;;N;;;;0287; -A7B2;LATIN CAPITAL LETTER J WITH CROSSED-TAIL;Lu;0;L;;;;;N;;;;029D; -A7B3;LATIN CAPITAL LETTER CHI;Lu;0;L;;;;;N;;;;AB53; -A7B4;LATIN CAPITAL LETTER BETA;Lu;0;L;;;;;N;;;;A7B5; -A7B5;LATIN SMALL LETTER BETA;Ll;0;L;;;;;N;;;A7B4;;A7B4 -A7B6;LATIN CAPITAL LETTER OMEGA;Lu;0;L;;;;;N;;;;A7B7; -A7B7;LATIN SMALL LETTER OMEGA;Ll;0;L;;;;;N;;;A7B6;;A7B6 -A7F7;LATIN EPIGRAPHIC LETTER SIDEWAYS I;Lo;0;L;;;;;N;;;;; -A7F8;MODIFIER LETTER CAPITAL H WITH STROKE;Lm;0;L;<super> 0126;;;;N;;;;; -A7F9;MODIFIER LETTER SMALL LIGATURE OE;Lm;0;L;<super> 0153;;;;N;;;;; -A7FA;LATIN LETTER SMALL CAPITAL TURNED M;Ll;0;L;;;;;N;;;;; 

Which includes all the characters that you mention in your question, and obviously much more.

Unfortunately, you cannot make python lower/upper use a newer version of UCD - although you can get an updated version of the unicodedata module by installing unicodedata2 . But that doesn’t really help you.

It seems like your only real choice is to flip your own lower/upper functions or switch to the latest version of Python 3.

+6
source

All Articles