Python3 src charset emojis

I want to print emojis from python (3) src

I am working on a project that analyzes Facebook Message Stories and in the downloaded htm raw file I find a lot of emoji like question mark boxes, how it happens when the value cannot be displayed. If I copy these characters to the terminal as strings, I get values ​​like \U000fe328 . This is also the output that I get when I run htm files through BeautifulSoup and output the data.

I searched for this line (and others), and sequentially one of the only sites that come with them is iemoji.com, in the case of the line above, this page listing the line as Python Src. I want to be able to print these lines as their respective emojis (after all, they were originally emojis when messaging), and after inspecting I found the src encodings mapping on this page that matched the above lines with the emoji line names. Then I found these emoji string names in the Unicode list , which for the most part seem to map emoji names to Unicode. If I try to print these values, I get a good result. As after

 >>> print(u'\U0001F624') 😀 

Is there any way to match these "Python src" encodings with their unicode values? Linking both libraries will work if not so that the original src mapping is missing about 50% of the unicode values ​​found in the unicode library. And if I still have to do this, is there a good way to find the Python Src value of a given emoji? From my testing emoji how strings are equal to their Unicode, like '😀' == u'\U0001F624' , but I can't get any relation to \U000fe328

+5
source share
1 answer

This has nothing to do with Python. A run like \U000fe328 just contains a hexadecimal representation of the code point, so this character is U+0FE328 (which is a private-use character).

These days, many emotion codes are assigned to code points, for example. 😀 - U+01F624 β€” FACE WITH LOOK OF TRIUMPH .

Before they were assigned, different programs used different code points in their personal use range to represent emoji. Facebook obviously used the personal character U+0FE328 . The mapping from these code points to standard code points is arbitrary. Some of them may not have a standard equivalent at all.

So what you need to look for is a table that indicates which of these old assignments correspond to the standard code point.

Here's the php-emoji on GitHub that seems to contain these mappings. But "\xf3\xbe\x8c\xa8" that this is PHP code, and the characters are represented as UTF-8 (for example, the character above would be "\xf3\xbe\x8c\xa8" ).

+2
source

All Articles