Python 3 controller with diacritics and ligatures,

Names in the form: Ceasar, Julius must be separated by First_name Julius Surname Ceasar.

Names may contain diacritics (à é ..) and ligatures (æ, ø)

This code works fine in Python 3.3

import re def doesmatch(pat, str): try: yup = re.search(pat, str) print('Firstname {0} lastname {1}'.format(yup.group(2), yup.group(1))) except AttributeError: print('no match for {0}'.format(str)) s = 'Révèrberë, Harry' t = 'Åapö, Renée' u = 'C3po, Robby' v = 'Mærsk, Efraïm' w = 'MacDønald, Ron' x = 'Sträßle, Mpopo' pat = r'^([^\d\s]+), ([^\d\s]+)' # matches any letter, diacritic or ligature, but not digits or punctuation inside the () for i in s, t, u, v, w, x: doesmatch(pat, i) 

Everything except matching u. (there are no matches for the numbers in the names), but I wonder if there is a better way than the non-discrete non-space approach. More important though: I would like to clarify the pattern so that it distinguishes capitals from lowercase letters, but including metropolitan diacritics and ligatures, preferably also using a regular expression. As if ([AZ] [az] +) would match accented and combined characters.

Is it possible?

(what I have looked so far: Dive into python 3 on UTF-8 and Unicode ; This Unicode regex tutorial (which I don't use); I think I don't need a new regex , but I admit that did not read all his documentation)

+7
source share
1 answer

If you want to distinguish between uppercase and lowercase letters using the standard library module re , then I'm afraid you will have to manually create a character class of all the corresponding Unicode codes.

If you really don't need to do this, use

 [^\W\d_] 

to match any Unicode letter. This character class matches any that is “not an alphanumeric character” (which matches an alphanumeric character), which is also not a number or underscore.

+3
source

All Articles