Python - Reading Unoode characters from Emoji

Question

Python - Reading Unoode characters from Emoji

I have a Python 2.7 program that reads iOS text messages from a SQLite database. Text messages are unicode strings. In the following text message:

u'that\u2019s \U0001f63b'

The apostrophe is presented \u2019 , but the emoji is represented \U0001f63b . I was looking for the code for the emoji in question and it is \uf63b . I'm not sure where 0001 comes from. I don't know much about character encoding.

When I type text, character by character, using:

 s = u'that\u2019s \U0001f63b' for c in s: print c.encode('unicode_escape')

The program produces the following output:

 t h a t \u2019 s \ud83d \ude3b

How can I read these last characters correctly in Python? Am I using encoding correctly here? Should I just try to destroy these 0001 before reading, or is there an easier, less stupid way?

+8

python python-2.7 unicode emoji

Andrew LaPrise Jul 07 '15 at 10:16

source share

2 answers

Your last part of the confusion is most likely due to the fact that you are using the so-called "narrow Python assembly". Python cannot contain a single character with enough information to hold a single emoji. The best solution would be to upgrade to Python 3. Otherwise, try processing a UTF-16 surrogate pair.

+3

Kupiakos Jul 07 '15 at 10:34

source share

pvg · Accepted Answer · 2015-07-07T22:25:00+0000

I don’t think you are using encoding correctly, and you don’t need it. What you have is a valid unicode string with one four-digit and one 8-bit escape sequence. Try this in REPL, say OS X

 >>> s = u'that\u2019s \U0001f63b' >>> print s that's 😻

In python3 though -

 Python 3.4.3 (default, Jul 7 2015, 15:40:07) >>> s = u'that\u2019s \U0001f63b' >>> s[-1] '😻'

Python - Reading Unoode characters from Emoji

More articles: