Url open encoding

Question

Url open encoding

I have the following code for urllib and BeautifulSoup:

getSite = urllib.urlopen(pageName) # open current site getSitesoup = BeautifulSoup(getSite.read()) # reading the site content print getSitesoup.originalEncoding for value in getSitesoup.find_all('link'): # extract all <a> tags defLinks.append(value.get('href'))

Result:

 /usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER. "Some characters could not be decoded, and were "

And when I try to read the site, I get:

  7 e    0*"I߷ G H    F      9-      ;  E YÞBs         㔶? 4i   )     ^W     `w Ke  %  *9 .'OQB   V  @     ]   (P  ^  q $ S5   tT* Z

+4

python beautifulsoup python-unicode

badc0re Jun 28 '12 at 8:56

source share

3 answers

Martijn pieters · Answer 1 · 2012-06-28T09:01:15+0000

BeautifulSoup works with Unicode internally; it will try and decode Unicode responses from UTF-8 by default.

It looks like the site you are trying to download is using different code; for example, there may be UTF-16 instead:

 >>> print u""" 7 e    0*"I߷ G H    F      9-      ;  E YÞBs         㔶? 4i   )     ^W     `w Ke  %  *9 .'OQB   V  @     ]   (P  ^  q $ S5   tT* Z""".encode('utf-8').decode('utf-16-le') 뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽

It could be mac_cyrillic :

 >>> print u""" 7 e    0*"I߷ G H    F      9-      ;  E YÞBs         㔶? 4i   )     ^W     `w Ke  %  *9 .'OQB   V  @     ]   (P  ^  q $ S5   tT* Z""".encode('utf-8').decode('mac_cyrillic') њљ7њљeњљњљњљњљ0*"IЈњљGњљHњљњљњљњљFњљњљњљњљњљњљ9-њљњљњљњљњљњљ;њљњљEњљY√Bsњљњљњљњљњљњљњљњљњљґ?њљ4iњљњљњљ)њљњљњљњљњљ^Wњљњљњљњљњљ`wњљKeњљњљ%њљњљ*9њљ.'OQBњљњљњљVњљњљ@њљњљњљњљњљ]њљњљњљ(Pњљњљ^њљњљqњљ$њљS5њљњљњљtT*њљZ

But I have too little information about which site you are loading, and I can not read the result of any encoding .:-)

You will need to decode the result of getSite() before passing it to BeautifulSoup:

 getSite = urllib.urlopen(pageName).decode('utf-16')

Typically, a website will return what encoding was used in the headers as a Content-Type header (possibly text/html; charset=utf-16 or similar).

Leonard Richardson · Answer 2 · 2012-06-28T14:26:46+0000

The page is in UTF-8, but the server sends it to you in a compressed format:

 >>> print getSite.headers['content-encoding'] gzip

You will need to unzip the data before launching it through Beautiful Soup. I got an error using zlib.decompress () for data, but writing data to a file and using gzip.open () to read from it worked fine - I'm not sure why.

zanbri · Answer 3 · 2013-01-03T14:39:20+0000

I ran into the same problem and, as Leonard said, this was due to the compressed format.

This link solved it for me, which says add ('Accept-Encoding', 'gzip,deflate') to the request header. For instance:

 opener = urllib2.build_opener() opener.addheaders = [('Referer', referer), ('User-Agent', uagent), ('Accept-Encoding', 'gzip,deflate')] usock = opener.open(url) url = usock.geturl() data = decode(usock) usock.close() return data

Where the decode() function is defined as follows:

 def decode (page): encoding = page.info().get("Content-Encoding") if encoding in ('gzip', 'x-gzip', 'deflate'): content = page.read() if encoding == 'deflate': data = StringIO.StringIO(zlib.decompress(content)) else: data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content)) page = data.read() return page

Url open encoding

More articles: