Urllib2 device that does not match the encoding

Question

Urllib2 device that does not match the encoding

When I open the URL and read it, I cannot recognize it. But when I check the content header, it says that it is encoded as utf-8. So I tried converting it to unicode, and it complained about UnicodeDecodeError: the ascii codec cannot decode the 0x8b byte at position 1: the serial number is not in the range (128) using unicode ().

.encode ("utf-8") produces a UnicodeDecodeError: codec 'ascii' cannot decode byte 0x8b at position 1: serial number not in range (128)

.decode ("utf-8") produced by UnicodeDecodeError: codec "utf8" cannot decode byte 0x8b at position 1: invalid start byte.

I tried everything I could think of (I'm not so good at encodings)

I would be glad if I could make it work. Thanks.

+4

python utf-8 character-encoding urllib2

thabubble Feb 25 '12 at 16:07

source share

2 answers

The title may be incorrect. Check out chardet .

EDIT: Think more about this - my money is on content that is gzipped. I believe that some of Python's various URL modules / classes / classes will be unpacked, while others will not.

0

Ben Feb 25 '12 at 16:18

source share

Vanuan · Accepted Answer · 2012-11-20T23:25:00+0000

This is a common mistake. The server sends a gzipped stream.

You must unpack it first:

response = opener.open(self.__url, data) if response.info().get('Content-Encoding') == 'gzip': buf = StringIO.StringIO( response.read()) gzip_f = gzip.GzipFile(fileobj=buf) content = gzip_f.read() else: content = response.read()

Urllib2 device that does not match the encoding

More articles: