I have a file with sentences, some of which are in Spanish and contain accented letters (like é) or special characters (like ¿). I should be able to search for these characters in a sentence so that I can determine if the sentence is in Spanish or English.
I tried my best to achieve this, but no luck that everything was right. Below is one of the solutions I tried, but clearly gave the wrong answer.
sentence = ¿Qué tipo es el? #in str format, received from standard open file method sentence = sentence.decode('latin-1') print 'é'.decode('latin-1') in sentence >>> False
I also tried using codecs.open (.., .., 'latin-1') to read in the file, but that did not help. Then I tried u'é'.encode ('latin-1') and it did not work.
I have no ideas, any suggestions?
@icktoofay provided a solution. I ended up saving file decoding (using latin-1) but then used Python Unicode for characters ( u'é' ). This required me to set the Python Unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, and then compare accordingly. Thanks guys for the tip and great support.
python string unicode
user1411331
source share