How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

I am trying to get a string for use in google geocoding. I checked a lot of threads, but I still run into a problem, and I don't understand how to solve it.

I need addresse1 to be a string without any special characters. Addresse1 is, for example, "32 rue d'Athènes Paris France".

addresse1= collect.replace(' ','+').replace('\n','') addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore') 

here I got a string without any accent ... Ho no ... This is not a string, but bytes. Therefore, I did what was suggested and “decrypted”:

 addresse1=addresse1.decode('utf-8') 

But then addresse1 is exactly the same as at the beginning ... What should I do? What am I doing wrong? Or what I don't understand with Unicode? Or is there a better solution?

Thanks,

Stéphane.

+6
source share
3 answers

with third-party package: unidecode

 3>> unidecode.unidecode("32 rue d'Athènes Paris France") "32 rue d'Athenes Paris France" 
+14
source

addresse1 = unicodedata.normalize ('NFKD', addresse1) .encode ('utf-8', 'ignore')

You probably meant .encode('ascii', 'ignore') to remove non-ASCII characters. UTF-8 contains all characters, so encoding on it does not eliminate any characters, and the encoding cycle with it is not an operator.

is there a better solution?

It depends on what you are trying to do.

If you want to remove diacritical marks and not lose all other non-ASCII characters, you can read unicodedata.category for each character after normalizing NFKD and delete them in category M.

If you want to transliterate into ASCII, which becomes a language-specific issue that requires special replacements (for example, in German ö becomes oe , but not in Swedish).

If you just want to wash a string in ASCII, because the presence of non-ASCII characters in it leads to code breaking, of course, it is much better to fix this code to work correctly with all Unicode characters than to distort good data. The letter è not encoded in ASCII, but not 99.9989% of all characters, so this is hardly likely to be "special." ASCII-only code is lame.

The Google geocoding API can work fine with Unicode, so there is no obvious reason why you would need to do any of this.

ETA:

 url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ... 

And, you need to URL-encode any data that you enter in the URL. This is not only for Unicode - the above will break for many ASCII punctuation characters. Use urllib.quote to encode a single line or urllib.encode to convert multiple parameters:

 params = dict( address=address1.encode('utf-8'), key=googlekey ) url2 = '...?' + urllib.urlencode(params) 

(In Python 3, it is urllib.parse.quote and urllib.parse.encode , and they automatically select UTF-8, so you do not need to manually encode them.)

 data2 = urllib.request.urlopen(url2).read().decode('utf-8') data3=json.loads(data2) 

json.loads reads byte strings, so you must be safe to omit UTF-8 decoding. In any case, json.load will read directly from the file object, so you won’t need to load data into a line at all:

 data3 = json.load(urllib.request.urlopen(url2)) 
+1
source

You can use the translate() method from python. Here's an example copied from tutorialspoint.com:

 #!/usr/bin/python from string import maketrans # Required to call maketrans function. intab = "aeiou" outtab = "12345" trantab = maketrans(intab, outtab) str = "this is string example....wow!!!"; print str.translate(trantab) 

It is output:

th3s 3s str3ng 2x1mpl2 .... w4w !!!

So, you can determine which characters you want to replace more easily than with replace()

0
source

All Articles