UnicodeDecodeError: codec 'utf8' cannot decode bytes at position 3-6: invalid data

how does unicode work on python2? I just do not understand.

here I download data from the server and parse it for JSON.

Traceback (most recent call last): File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/hubs/poll.py", line 92, in wait readers.get(fileno, noop).cb(fileno) File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/greenthread.py", line 202, in main result = function(*args, **kwargs) File "android_suggest.py", line 60, in fetch suggestions = suggest(chars) File "android_suggest.py", line 28, in suggest return [i['s'] for i in json.loads(opener.open('https://market.android.com/suggest/SuggRequest?json=1&query='+s+'&hl=de&gl=DE').read())] File "/usr/lib/python2.6/json/__init__.py", line 307, in loads return _default_decoder.decode(s) File "/usr/lib/python2.6/json/decoder.py", line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode obj, end = self._scanner.iterscan(s, **kw).next() File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib/python2.6/json/decoder.py", line 217, in JSONArray value, end = iterscan(s, idx=end, context=context).next() File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject value, end = iterscan(s, idx=end, context=context).next() File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString return scanstring(match.string, match.end(), encoding, strict) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data 

Thank you!

EDIT: the following line throws an error: '[{"t":"q","s":"abh\xf6ren"}]' . \xf6 must be decoded to รถ (abhorn)

+50
python unicode
May 30 '11 at 20:28
source share
8 answers

The string you are trying to parse as JSON is not encoded in UTF-8. Most likely, it is encoded in ISO-8859-1. Try the following:

 json.loads(unicode(opener.open(...), "ISO-8859-1")) 

This will handle any umlauts that might fall into the JSON message.

You Must Read Joel Spolsky Absolute Minimum Every software developer should absolutely, positively need to know about Unicode and character sets (no excuses!) . Hope this clarifies some of the problems that arise around Unicode.

+82
May 31 '11 at 16:16
source share

My decision is a bit ridiculous. I never thought it would be as easy as with the UTF-8 codec. I am using notepad ++ (v5.6.8). I did not notice that I saved it using the ANSI codec initially. I use a separate file to host the entire localized dictionary. I found my solution under the "Encoding" tab from my Notepad ++. I select "UTF-8 Encoding Without Specification" and save it. It works brilliantly.

+6
Jun 11 2018-12-12T00:
source share

The error you see means that the data you receive from the remote end is invalid JSON. JSON (as per specification) is usually UTF-8, but it can also be UTF-16 or UTF-32 (with either large or small print). The exact error you see means that some of the data was not valid UTF-8 (nor was it UTF-16 or UTF-32, as they can lead to various errors.)

Perhaps you should examine the actual answer you get from the far end, instead of blindly passing data to json.loads() . Right now, you are reading all the data from the response into a string and assume it is JSON. Instead, check the type of response content. Make sure that the webpage actually requires you to provide JSON, and not, for example, an error message that is not JSON.

(Also, after checking the response, use json.load() , passing it the file-like object returned by opener.open() , instead of reading all the data into a string and passing it to json.loads() .)

+4
May 30 '11 at 21:07
source share

The decision to change the encoding in Latin1 / ISO-8859-1 solves the problem that I observed with html2text.py, which is called on tex4ht output. I use this to automatically count words on LaTeX documents: tex4ht converts them to HTML and then html2text.py breaks them into clear text for further counting through wc-w. Now, if, for example, the German "Umlaut" comes through an entry in the literature database, this process will fail because html2text.py will complain, for example.

UnicodeDecodeError: codec 'utf8' cannot decode bytes at position 32243-32245: invalid data

Now these errors will be especially difficult to track later, and, in fact, you want to have Umlaut in your link section. Simple change inside html2text.py from

data = data.decode (encoding)

to

data = data.decode ("ISO-8859-1")

solves this problem; if you call the script using the HTML file as the first parameter, you can also pass the encoding as the second parameter and save the modification.

+3
Aug 21 '13 at 23:16
source share

Just in case someone has the same problem. I use vim with YouCompleteMe , it was not possible to start ycmd with this error message, what I did: export LC_CTYPE="en_US.UTF-8" , the problem is that it was gone.

+1
Apr 10 '14 at 11:30
source share

Paste this into your command line:

 export LC_CTYPE="en_US.UTF-8" 
+1
Jun 04 '16 at 19:13
source share

unicode(urllib2.urlopen(url).read(), 'utf8') : unicode(urllib2.urlopen(url).read(), 'utf8') - this should work if the return is UTF-8.

urlopen().read() return the bytes, and you should decode them into unicode strings. It would also be useful to check the patch from http://bugs.python.org/issue4733

0
May 30 '11 at 20:30
source share

In your android_suggest.py, unzip this monstrous single-line return statement in one_step_at_a_time. Write repr(string_passed_to_json.loads) somewhere so that it can be checked after an exception occurs. Get the results. If the problem is not obvious, edit your question to show the registry.

0
May 30 '11 at 21:30
source share



All Articles