Parsing html tags with Python

Question

Parsing html tags with Python

I have been given a URL and I want to extract the contents of the tag <BODY>from the URL. I am using Python3. I stumbled upon sgmllib, but it is not available for Python3.

Can someone help me with this? Can I use HTMLParserfor this?

Here is what I tried:

import urllib.request
f=urllib.request.urlopen("URL")
s=f.read()

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)

parser = MyHTMLParser()
parser.feed(s)

this gives me an error: TypeError: Unable to convert the 'bytes' object to str implicitly

+5

python-3.x

gsb Feb 01 '12 at 20:08

source share

2 answers

If you look at your variable s, its type will be a byte.

>>> type(s)
<class 'bytes'>

Parser.feed, unicode . , do

>>> x = s.decode('utf-8')
>>> type(x)
<class 'str'>
>>> parser.feed(x)

do x = str(s).

+4

RanRag 01 . '12 20:16

pycoder112358 · Accepted Answer · 2012-02-01T20:51:47+0000

To fix type 3 change line:

s = str (f.read ())

The webpage you receive is returned in bytes, and you need to change the bytes in a string to pass them to the parser.

Parsing html tags with Python

More articles: