Python - Reading Unoode characters from Emoji

I have a Python 2.7 program that reads iOS text messages from a SQLite database. Text messages are unicode strings. In the following text message:

u'that\u2019s \U0001f63b' 

The apostrophe is presented \u2019 , but the emoji is represented \U0001f63b . I was looking for the code for the emoji in question and it is \uf63b . I'm not sure where 0001 comes from. I don't know much about character encoding.

When I type text, character by character, using:

 s = u'that\u2019s \U0001f63b' for c in s: print c.encode('unicode_escape') 

The program produces the following output:

 t h a t \u2019 s \ud83d \ude3b 

How can I read these last characters correctly in Python? Am I using encoding correctly here? Should I just try to destroy these 0001 before reading, or is there an easier, less stupid way?

+8
python unicode emoji
source share
2 answers

I donโ€™t think you are using encoding correctly, and you donโ€™t need it. What you have is a valid unicode string with one four-digit and one 8-bit escape sequence. Try this in REPL, say OS X

 >>> s = u'that\u2019s \U0001f63b' >>> print s that's ๐Ÿ˜ป 

In python3 though -

 Python 3.4.3 (default, Jul 7 2015, 15:40:07) >>> s = u'that\u2019s \U0001f63b' >>> s[-1] '๐Ÿ˜ป' 
+17
source share

Your last part of the confusion is most likely due to the fact that you are using the so-called "narrow Python assembly". Python cannot contain a single character with enough information to hold a single emoji. The best solution would be to upgrade to Python 3. Otherwise, try processing a UTF-16 surrogate pair.

+3
source share

All Articles