Odd behavior when cycling through a unicode string

Question

Odd behavior when cycling through a unicode string

When I do this:

text = u"奥巴马讲话"
for c in text:
    print c

I got the expected result:

奥
巴
马
讲
话

But if I do this:

text = u"𤭢€"
for c in text:
    print c

I got:

€

I expect to receive:

𤭢
€

Why is this? I think this has something to do with the following fact:

In [1]: u"𤭢".encode("utf8")
Out[1]: '\xf0\xa4\xad\xa2'

"𤭢" is encoded using 4 bytes.

How can I scroll a unicode string that has this kind of encoding?

Something like u "𤭢𤭢𤭢𤭢𤭢𤭢".

+4

python unicode

lessthanl0l Jul 21 '14 at 13:28

source share

1 answer

ecatmur · Accepted Answer · 2014-07-21T13:33:51+0000

𤭢 is outside the base multilingual plane; It has a code point U + 24B62. This means that for proper processing you need a Python build with sys.maxunicode == 1114111. See Unicode in Python for more details - UTF-16 only? .

, Python 3.3, . UTF-16 , : Unicode Python 3?

Odd behavior when cycling through a unicode string

More articles: