I am trying to identify and save all the headers on a particular site and continue to receive what, in my opinion, is a coding error.
Website: http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm
current code:
holder = {}
url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
soup = BeautifulSoup(url, 'lxml')
head1 = soup.find_all(['h1','h2','h3'])
print head1
holder["key"] = head1
Print Output:
[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]
I'm sure these are Unicode characters, but I couldn't figure out how to convince python to display them as characters.
I tried to find the answer elsewhere. The question that was clearer was this:
Problems with Python and BeautifulSoup encoding
who suggested adding
soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))
, ( "AttributeError: type object" BeautifulSoup " " BeautifulSoup ")
".BeautifulSoup" ( "RuntimeError: Python" ).
:
BeautifulSoup Python?
html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)
. .