How to account for accent characters for regular expressions in Python?

Question

How to account for accent characters for regular expressions in Python?

I am currently using re.findall to search and highlight words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)

It searches str1 and finds all hashtags. This works, however, it does not take into account accented characters such as these, for example: áéíóúñü¿ .

If one of these letters is in str1, it will save the hash tag up to the letter in front of it. So, for example, #yogenfrüz will be #yogenfr .

I need to take into account all letters with an accent, which vary from German, Dutch, French and Spanish, so that I can save hashtags, for example #yogenfrüz

How can i do this

+8

python django regex hashtag non-ascii-characters

noahandthewhale Sep 06 '13 at 17:48

source share

2 answers

You can also use

 import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how can I convert all these escape characters to corresponding characters, for example, if there is unicode à, how can I convert this to standard? Suppose you loaded your unicode into a variable called my_unicode ... à in normalization is simple ...

import unicodedata output = unicodedata.normalize ('NFD', my_unicode) .encode ('ascii', 'ignore') Explicit example ...

 myfoo = u'àà' myfoo u'\xe0\xe0' unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore') 'aa'

check this answer, it really helped me: How to convert unicode characters to pure ascii without accents?

+2

Berk Feb 12 '17 at 19:41

source share

Ibrahim najjar · Accepted Answer · 2013-09-06T17:52:15+0000

Try the following:

 hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Demo version of Regex101

EDIT Check out the helpful comment below from Martijn Pieters.

How to account for accent characters for regular expressions in Python?

More articles: