How to find accented characters in a string in Python?

Question

How to find accented characters in a string in Python?

I have a file with sentences, some of which are in Spanish and contain accented letters (like é) or special characters (like ¿). I should be able to search for these characters in a sentence so that I can determine if the sentence is in Spanish or English.

I tried my best to achieve this, but no luck that everything was right. Below is one of the solutions I tried, but clearly gave the wrong answer.

sentence = ¿Qué tipo es el? #in str format, received from standard open file method sentence = sentence.decode('latin-1') print 'é'.decode('latin-1') in sentence >>> False

I also tried using codecs.open (.., .., 'latin-1') to read in the file, but that did not help. Then I tried u'é'.encode ('latin-1') and it did not work.

I have no ideas, any suggestions?

@icktoofay provided a solution. I ended up saving file decoding (using latin-1) but then used Python Unicode for characters ( u'é' ). This required me to set the Python Unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, and then compare accordingly. Thanks guys for the tip and great support.

+8

python string unicode

user1411331 Nov 10 '12 at 20:22

source share

2 answers

icktoofay · Answer 1 · 2012-11-10T20:24:57+0000

Use unicodedata.normalize in the string before validation.

Explanation

Unicode offers several forms for creating some characters. For example, á can be represented by one character, á or two characters: a , then "put a ´ on top of it." Normalizing a string will cause it to appear in one or the other representation. (whose normalized representation depends on what you pass as the form parameter)

Mark tolonen · Answer 2 · 2012-11-11T19:37:49+0000

I suspect your terminal is using UTF-8, so 'é'.decode('latin-1') incorrect. Just use the Unicode constant instead of u'é' .

To properly handle Unicode in a script, declare the script and encodings of the data files, as well as decode the incoming data and encode the outgoing data. Using Unicode strings for text in a script.

Example (save script in UTF-8):

 # coding: utf8 import codecs with codecs.open('input.txt',encoding='latin-1') as f: sentence = f.readline() if u'é' in sentence: print u'Found é'

Note that print implicitly encodes output in terminal encoding.

How to find accented characters in a string in Python?

Explanation

More articles: