How to account for accent characters for regular expressions in Python?

I am currently using re.findall to search and highlight words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1) 

It searches str1 and finds all hashtags. This works, however, it does not take into account accented characters such as these, for example: áéíóúñü¿ .

If one of these letters is in str1, it will save the hash tag up to the letter in front of it. So, for example, #yogenfrüz will be #yogenfr .

I need to take into account all letters with an accent, which vary from German, Dutch, French and Spanish, so that I can save hashtags, for example #yogenfrüz

How can i do this

+8
python django regex hashtag non-ascii-characters
source share
2 answers

Try the following:

 hashtags = re.findall(r'#(\w+)', str1, re.UNICODE) 

Demo version of Regex101

EDIT Check out the helpful comment below from Martijn Pieters.

+21
source share

You can also use

 import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore') 

how can I convert all these escape characters to corresponding characters, for example, if there is unicode à, how can I convert this to standard? Suppose you loaded your unicode into a variable called my_unicode ... à in normalization is simple ...

import unicodedata output = unicodedata.normalize ('NFD', my_unicode) .encode ('ascii', 'ignore') Explicit example ...

 myfoo = u'àà' myfoo u'\xe0\xe0' unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore') 'aa' 

check this answer, it really helped me: How to convert unicode characters to pure ascii without accents?

+2
source share

All Articles