In general, input validation should be done using a white list of valid characters if you can define such a thing for your use case. Then you simply drop everything that is not in the white list (or donβt reject the entry at all).
If you can define a set of valid characters, then you can use a regular expression to highlight everything else.
For example, let's say you know that a βcountryβ will only have English letters and spaces in uppercase, which you could cut out everything else, including your nasty unicode letter, such as:
>>> import re >>> country = u'FRANCE\ufeff' >>> clean_pattern = re.compile(u'[^AZ ]+') >>> clean_pattern.sub('', country) u'FRANCE'
If you cannot determine the set of valid characters, you have serious problems, because your task is to anticipate all the tens of thousands of possible unexpected Unicode characters that can be thrown at you - and more and more are added to the specifications as languages ββevolve over the years.
Nathan
source share