Python Encoding / Decoding Issues

Question

Python Encoding / Decoding Issues

How to decode strings like this one weren \ xe2 \ x80 \ x99t, back to normal encoding.

So, this word really wasn’t, and not "weren \ xe2 \ x80 \ x99t"? For instance:

print "\xe2\x80\x9cThings" string = "\xe2\x80\x9cThings" print string.decode('utf-8') print string.encode('ascii', 'ignore') â€œThings "Things Things

But I really want to get "Things."

or

 print "weren\xe2\x80\x99t" string = "weren\xe2\x80\x99t" print string.decode('utf-8') print string.encode('ascii', 'ignore') werenâ€™t weren't werent

But I really want to get there.

How can I do it?

+5

python python-2.7 encoding ascii non-ascii-characters

Brana Jan 17 '15 at 5:24

source share

2 answers

You should provide a translation card that displays Unicode characters to other Unicode characters (the latter must be within the ASCII range if you want to transcode it):

 uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")} yourstring.decode('utf-8').translate(uni2ascii).encode('ascii') print(yourstring) # prints: "weren't"

+1

Oliver W. Jan 17 '15 at 12:58

source share

Brana · Accepted Answer · 2015-01-18T00:47:13+0000

I displayed the most common weird characters, so this is a fairly complete answer based on the answer of Oliver B.

This feature is by no means perfect, but it is the best place to start. There are more character definitions:

http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

...

 def unicodetoascii(text): uni2ascii = { ord('\xe2\x80\x99'.decode('utf-8')): ord("'"), ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'), ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'), ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'), ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'), ord('\xc3\xa9'.decode('utf-8')): ord('e'), ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'), ord('\xe2\x80\x93'.decode('utf-8')): ord('-'), ord('\xe2\x80\x92'.decode('utf-8')): ord('-'), ord('\xe2\x80\x94'.decode('utf-8')): ord('-'), ord('\xe2\x80\x94'.decode('utf-8')): ord('-'), ord('\xe2\x80\x98'.decode('utf-8')): ord("'"), ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"), ord('\xe2\x80\x90'.decode('utf-8')): ord('-'), ord('\xe2\x80\x91'.decode('utf-8')): ord('-'), ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"), ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"), ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"), ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"), ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"), ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"), ord('\xe2\x81\xba'.decode('utf-8')): ord("+"), ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"), ord('\xe2\x81\xbc'.decode('utf-8')): ord("="), ord('\xe2\x81\xbd'.decode('utf-8')): ord("("), ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"), } return text.decode('utf-8').translate(uni2ascii).encode('ascii') print unicodetoascii("weren\xe2\x80\x99t")

Python Encoding / Decoding Issues

More articles: