Python regex that matches the regional indicator character class

Question

Python regex that matches the regional indicator character class

Flags in emoji are indicated by a pair of regional indicator symbols . I would like to write python regex to insert spaces between the emoji flag string.

For example, this line contains two Brazilian flags:

u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7"

What will look like this: 🇧🇷🇧🇷

I would like to insert spaces between any pairs of regional indicator symbols. Something like that:

 re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"), r"\1 ", u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7")

This will lead to:

 u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 "

But this code gives me an error:

 sre_constants.error: bad character range

The hint (I think) about what is going wrong is the following, which shows that \ U0001F1E7 is turning into two "characters" in the regular expression:

 re.search(re.compile(u"([\U0001F1E7])"), u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0)

This leads to:

 u'\ud83c'

Unfortunately, my understanding of Unicode is too weak for me to make further progress.

EDIT: I am using python 2.7.10 on Mac.

+5

python regex unicode

John rauser Aug 23 '16 at 18:26

source share

1 answer

Antti haapala · Accepted Answer · 2016-08-23T18:32:00+0000

I believe that you are using Python 2.7 on Windows or Mac, which has a narrow 16-bit Unicode build. Linux / Glibc typically has 32-bit full Unicode, and Python 3.5 has wide Unicode on all platforms.

What you see is one code divided into a surrogate pair. Unfortunately, this also means that you cannot easily use a single character class for this task. However, this is still possible. UTF-16 representation U + 1F1E6 (🇦) - \uD83C\uDDE6 , and U + 1F1FF (🇿) \uD83C\uDDFF .

I don’t even have access to such a Python assembly, but you can try

 \uD83C[\uDDE6-\uDDFF]

as a replacement for a single [\U0001F1E6-\U0001F1FF] , so all of your regular expression will be

 (\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF])

The reason the character class does not work is because it tries to make a range from the second half of the first surrogate pair to the first half of the second pair of surrogates - this does not succeed, since the beginning of the range is lexicographically larger than the end.

However, this regular expression will still not work on Linux, you need to use the original there, since Linux uses widescreen unicode by default.

Also, upgrade your Windows Python to version 3.5 or higher.

Python regex that matches the regional indicator character class

More articles: