How to remove unicode "punctuation" from a Python string

Question

How to remove unicode "punctuation" from a Python string

Here's the problem, I have a unicode string as input for a python sqlite query. Request failed (“like”). It turns out that the string "FRANCE" does not have 6 characters, it has seven. And the seventh., Unicode U + FEFF, space without a gap of zero width.

How can I grab a class of such things before a request?

+6

python unicode punctuation

Dave fultz Mar 24 '11 at 4:36

source share

3 answers

In general, input validation should be done using a white list of valid characters if you can define such a thing for your use case. Then you simply drop everything that is not in the white list (or don’t reject the entry at all).

If you can define a set of valid characters, then you can use a regular expression to highlight everything else.

For example, let's say you know that a “country” will only have English letters and spaces in uppercase, which you could cut out everything else, including your nasty unicode letter, such as:

 >>> import re >>> country = u'FRANCE\ufeff' >>> clean_pattern = re.compile(u'[^AZ ]+') >>> clean_pattern.sub('', country) u'FRANCE'

If you cannot determine the set of valid characters, you have serious problems, because your task is to anticipate all the tens of thousands of possible unexpected Unicode characters that can be thrown at you - and more and more are added to the specifications as languages evolve over the years.

+1

Nathan Mar 24 '11 at 4:56

source share

It is also a byte character specification. First clean your lines to eliminate them using something like:

 >>> f = u'France\ufeff' >>> f u'France\ufeff' >>> print f France >>> f.replace(u'\ufeff', '') u'France' >>> f.strip(u'\ufeff') u'France'

0

jcomeau_ictx Mar 24 '11 at 4:42

source share

Andreas Jung · Accepted Answer · 2011-03-24T04:45:33+0000

You can use the unicodedata categories as part of the unicode data table in Python:

>>> unicodedata.category(u'a') 'Ll' >>> unicodedata.category(u'.') 'Po' >>> unicodedata.category(u',') 'Po'

Punctuation character categories begin with "P", as you can see. So you need to filter char to char (using list comprehension).

How to remove unicode "punctuation" from a Python string

More articles: