Python regex that matches the regional indicator character class

Flags in emoji are indicated by a pair of regional indicator symbols . I would like to write python regex to insert spaces between the emoji flag string.

For example, this line contains two Brazilian flags:

u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7" 

What will look like this: πŸ‡§πŸ‡·πŸ‡§πŸ‡·

I would like to insert spaces between any pairs of regional indicator symbols. Something like that:

 re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"), r"\1 ", u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7") 

This will lead to:

 u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 " 

But this code gives me an error:

 sre_constants.error: bad character range 

The hint (I think) about what is going wrong is the following, which shows that \ U0001F1E7 is turning into two "characters" in the regular expression:

 re.search(re.compile(u"([\U0001F1E7])"), u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0) 

This leads to:

 u'\ud83c' 

Unfortunately, my understanding of Unicode is too weak for me to make further progress.

EDIT: I am using python 2.7.10 on Mac.

+5
source share
1 answer

I believe that you are using Python 2.7 on Windows or Mac, which has a narrow 16-bit Unicode build. Linux / Glibc typically has 32-bit full Unicode, and Python 3.5 has wide Unicode on all platforms.

What you see is one code divided into a surrogate pair. Unfortunately, this also means that you cannot easily use a single character class for this task. However, this is still possible. UTF-16 representation U + 1F1E6 (πŸ‡¦) - \uD83C\uDDE6 , and U + 1F1FF (πŸ‡Ώ) \uD83C\uDDFF .

I don’t even have access to such a Python assembly, but you can try

 \uD83C[\uDDE6-\uDDFF] 

as a replacement for a single [\U0001F1E6-\U0001F1FF] , so all of your regular expression will be

 (\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF]) 

The reason the character class does not work is because it tries to make a range from the second half of the first surrogate pair to the first half of the second pair of surrogates - this does not succeed, since the beginning of the range is lexicographically larger than the end.

However, this regular expression will still not work on Linux, you need to use the original there, since Linux uses widescreen unicode by default.


Also, upgrade your Windows Python to version 3.5 or higher.

+8
source

All Articles