How to search and replace utf-8 special characters in Python?

I'm a Python beginner and I have a problem with utf-8.

I have a utf-8 string and I would like to replace all German umlauts with ASCII (in German, u-umlaut 'ü' can be rewritten as 'ue').

u-umlaut has a unicode code of 252, so I tried this:

>>> str = unichr(252) + 'ber' >>> print repr(str) u'\xfcber' >>> print repr(str).replace(unichr(252), 'ue') u'\xfcber' 

I expected the last line to be u'ueber' .

Ultimately, I want to replace all u-umlauts in the file with 'ue':

 import sys import codecs f = codecs.open(sys.argv[1],encoding='utf-8') for line in f: print repr(line).replace(unichr(252), 'ue') 

Thanks for your help! (I am using Python 2.3.)

+4
source share
3 answers

repr(str) returns the quoted version of str , which when printed will be what you could enter as Python to return the string. So this is a string that literally contains \xfcber , not a string containing über .

You can simply use str.replace(unichr(252), 'ue') to replace ü with ue .

If you need to get the quoted version of the result, although I do not believe that you need it, you can wrap the whole expression in repr :

 repr(str.replace(unichr(252), 'ue')) 
+8
source

I think it’s easier and more straightforward to do this more straightforward, using the direct unicode os 'ü' representation is better than unichr (252).

 >>> s = u'über' >>> s.replace(u'ü', 'ue') u'ueber' 

There is no need to use the repr function, as this will print the "Python view" of the string, you just need to represent the readable string.

You will also need to include the following line at the beginning of the .py file, if it is not already present, report the file encoding

 #-*- coding: UTF-8 -*- 

Added: Of course, the declared encoding should be the same as the encoding of the file. Please check that there may be some problems (for example, I had problems with Eclipse on Windows, since it writes files as cp1252 by default. It should also be the same system encoding, which can be utf-8 or latin -1 or others.


Also, do not use str as a variable definition, as it is part of the Python library. You may have problems later.

(I'm trying to use Python 2.6, I think the result is the same in Python 2.3)

+7
source

You can avoid everything related to the encoding of the source file and its problems. Use Unicode names, then its critically obvious what you are doing, and the code can be read and modified anywhere.

I don’t know a single language where the only accented Latin letter is lowercase-u-with-umlaut-aka-diaresis, so I added code to iterate over the translation table under the assumption that I need you.

 # coding: ascii translations = ( (u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'), (u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'), # et cetera ) test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen' out = test for from_str, to_str in translations: out = out.replace(from_str, to_str) print out 

output:

 Moeller von Muenchen 
+5
source

All Articles