Url open encoding

I have the following code for urllib and BeautifulSoup:

getSite = urllib.urlopen(pageName) # open current site getSitesoup = BeautifulSoup(getSite.read()) # reading the site content print getSitesoup.originalEncoding for value in getSitesoup.find_all('link'): # extract all <a> tags defLinks.append(value.get('href')) 

Result:

 /usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER. "Some characters could not be decoded, and were " 

And when I try to read the site, I get:

  7 e    0*"I߷ G H    F      9-      ;  E YÞBs         㔶? 4i   )     ^W     `w Ke  %  *9 .'OQB   V  @     ]   (P  ^  q $ S5   tT* Z 
+4
source share
3 answers

BeautifulSoup works with Unicode internally; it will try and decode Unicode responses from UTF-8 by default.

It looks like the site you are trying to download is using different code; for example, there may be UTF-16 instead:

 >>> print u""" 7 e    0*"I߷ G H    F      9-      ;  E YÞBs         㔶? 4i   )     ^W     `w Ke  %  *9 .'OQB   V  @     ]   (P  ^  q $ S5   tT* Z""".encode('utf-8').decode('utf-16-le') 뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽 

It could be mac_cyrillic :

 >>> print u""" 7 e    0*"I߷ G H    F      9-      ;  E YÞBs         㔶? 4i   )     ^W     `w Ke  %  *9 .'OQB   V  @     ]   (P  ^  q $ S5   tT* Z""".encode('utf-8').decode('mac_cyrillic') њљ7њљeњљњљњљњљ0*"IЈњљGњљHњљњљњљњљFњљњљњљњљњљњљ9-њљњљњљњљњљњљ;њљњљEњљY√Bsњљњљњљњљњљњљњљњљњљґ?њљ4iњљњљњљ)њљњљњљњљњљ^Wњљњљњљњљњљ`wњљKeњљњљ%њљњљ*9њљ.'OQBњљњљњљVњљњљ@њљњљњљњљњљ]њљњљњљ(Pњљњљ^њљњљqњљ$њљS5њљњљњљtT*њљZ 

But I have too little information about which site you are loading, and I can not read the result of any encoding .:-)

You will need to decode the result of getSite() before passing it to BeautifulSoup:

 getSite = urllib.urlopen(pageName).decode('utf-16') 

Typically, a website will return what encoding was used in the headers as a Content-Type header (possibly text/html; charset=utf-16 or similar).

+2
source

The page is in UTF-8, but the server sends it to you in a compressed format:

 >>> print getSite.headers['content-encoding'] gzip 

You will need to unzip the data before launching it through Beautiful Soup. I got an error using zlib.decompress () for data, but writing data to a file and using gzip.open () to read from it worked fine - I'm not sure why.

+2
source

I ran into the same problem and, as Leonard said, this was due to the compressed format.

This link solved it for me, which says add ('Accept-Encoding', 'gzip,deflate') to the request header. For instance:

 opener = urllib2.build_opener() opener.addheaders = [('Referer', referer), ('User-Agent', uagent), ('Accept-Encoding', 'gzip,deflate')] usock = opener.open(url) url = usock.geturl() data = decode(usock) usock.close() return data 

Where the decode() function is defined as follows:

 def decode (page): encoding = page.info().get("Content-Encoding") if encoding in ('gzip', 'x-gzip', 'deflate'): content = page.read() if encoding == 'deflate': data = StringIO.StringIO(zlib.decompress(content)) else: data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content)) page = data.read() return page 
+1
source

All Articles