How to remove unicode "punctuation" from a Python string

Here's the problem, I have a unicode string as input for a python sqlite query. Request failed (β€œlike”). It turns out that the string "FRANCE" does not have 6 characters, it has seven. And the seventh., Unicode U + FEFF, space without a gap of zero width.

How can I grab a class of such things before a request?

+6
python unicode punctuation
source share
3 answers

You can use the unicodedata categories as part of the unicode data table in Python:

>>> unicodedata.category(u'a') 'Ll' >>> unicodedata.category(u'.') 'Po' >>> unicodedata.category(u',') 'Po' 

Punctuation character categories begin with "P", as you can see. So you need to filter char to char (using list comprehension).

See also:

in your case:

 >>> unicodedata.category(u'\ufeff') 'Cf' 

This way you can do some whitelisting based on categories for characters.

+10
source share

In general, input validation should be done using a white list of valid characters if you can define such a thing for your use case. Then you simply drop everything that is not in the white list (or don’t reject the entry at all).

If you can define a set of valid characters, then you can use a regular expression to highlight everything else.

For example, let's say you know that a β€œcountry” will only have English letters and spaces in uppercase, which you could cut out everything else, including your nasty unicode letter, such as:

 >>> import re >>> country = u'FRANCE\ufeff' >>> clean_pattern = re.compile(u'[^AZ ]+') >>> clean_pattern.sub('', country) u'FRANCE' 

If you cannot determine the set of valid characters, you have serious problems, because your task is to anticipate all the tens of thousands of possible unexpected Unicode characters that can be thrown at you - and more and more are added to the specifications as languages ​​evolve over the years.

+1
source share

It is also a byte character specification. First clean your lines to eliminate them using something like:

 >>> f = u'France\ufeff' >>> f u'France\ufeff' >>> print f France >>> f.replace(u'\ufeff', '') u'France' >>> f.strip(u'\ufeff') u'France' 
0
source share

All Articles