Encoding error while deserializing json object from google

As an exercise, I built a small script that requests the Google Suggest JSON API. The code is pretty simple:

query = 'a' url = "http://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query response = urllib.urlopen(url) result = json.load(response) UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte 

If I try the read() response object, this is what I have:

 '["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]' 

Thus, this means that an error occurs when python tries to decode the string. This only happens with google.co.jp and Japanese. I tried the same code with different versions of contry / languages, and I did not get the same problem: when I try to deserialize the object, everything works fine.

  • I checked the response headers and they always indicate utf-8 as the response encoding.
  • I checked the JSON string with the online parser (http://json.parser.online.fr/) and again all seams are OK

Any ideas to solve this problem? What makes the JSON load() function a throttle?

Thanks in advance.

+1
source share
2 answers

The response header ( print response.header ) contains the following information:

 Content-Type: text/javascript; charset=Shift_JIS 

Pay attention to the encoding.

If you specify this encoding in json.load , it will work:

 result = json.load(response, encoding='shift_jis') 
+3
source

Regardless of what the spec says, the string "\ x83A \ x83} \ x83] \ x83 \ x93" is not UTF-8.

Assuming this is one of ["cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213"]; try to decode as one of them.

0
source

All Articles