Trademark character length in python 2.x

Question

Trademark character length in python 2.x

why

>>> len('™') >>> 3

in python 2.x?

how can I quickly fix it so that it can be considered as a single character (e.g. Python 3.x?)

+4

python-2.7 encoding

Francesco della vedova Mar 08 '13 at 16:46

source share

1 answer

Martijn pieters · Accepted Answer · 2013-03-08T16:48:26+0000

Your terminal coding is set to UTF8. You count the bytes in the encoded character:

 >>> '™' '\xe2\x84\xa2' >>> len('™') 3

Use unicode to count characters instead of bytes:

 >>> u'™' u'\u2122' >>> len(u'™') 1

or decode from terminal encoding:

 >>> import sys >>> '™'.decode(sys.stdin.encoding) u'\u2122'

In Python 3, strings have unicode values, and the type of Python 2 str renamed to byte (your input is essentially the same as b'™' in Python 3).

You can read in Python and Unicode:

Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
Absolute Minimum Every software developer Absolutely, positively needs to know about Unicode and character sets (no excuses!) From Joel Spolsky

Trademark character length in python 2.x

More articles: