Abort HTMLParser Processing in Python

Question

Abort HTMLParser Processing in Python

When using the HTMLParser class in Python, is it possible to interrupt processing in the handle_* function? At the beginning of the processing, I get all the necessary data, so it seems that the waste continues to be processed. The following is an example of extracting metadata for a document.

 from HTMLParser import HTMLParser class MyParser(HTMLParser): def handle_start(self, tag, attrs): in_meta = False if tag == 'meta': for attr in attrs: if attr[0].lower() == 'name' and attr[1].lower() == 'description': in_meta = True if attr[0].lower() == 'content': print(attr[1]) # Would like to tell the parser to stop now, # since I have all the data that I need

+4

python html parsing

Michael mior Jan 2 '09 at 7:37

source share

3 answers

If you use the pyparsing scanString method, you have more control over how far you go through the input string. In your example, we create an expression that matches the <meta> tag and add a parsing action that ensures that only the tag matches with name="description" . This code assumes that you have read the HTML pages in the htmlsrc variable:

 from pyparsing import makeHTMLTags, withAttribute # makeHTMLTags creates both open and closing tags, only care about the open tag metaTag = makeHTMLTags("meta")[0] metaTag.setParseAction(withAttribute(name="description")) try: # scanString is a generator that returns each match as it is found # in the input tokens,startloc,endloc = metaTag.scanString(htmlsrc).next() # attributes can be accessed like object attributes if they are # valid Python names print tokens.content # if the attribute name clashes with a Python keyword, or is # otherwise unsuitable as an identifier, use dict-like access instead print tokens["content"] except StopIteration: print "no matching meta tag found"

+1

Paulmcg Jan 2 '09 at 23:29

source share

@Shylent extension of the answer, here is my solution:

 class MyParser(HTMLParser): boolean_flag = False def handle_starttag(self, tag, attrs): # for example: self.boolean_flag = (tag == "sometag" and ("id", "someid") in attrs) def handle_endtag(self, tag): pass def handle_data(self, data): if self.boolean_flag: raise DataParsedException(data) class DataParsedException(Exception): def __init__(self, data): self.data = data

Using:

 try: parser.feed(html.decode()) except DataParsedException as dataParsed: vars.append(dataParsed.data)

Performs this work.

0

Yekhezkel yovel May 02 '16 at 6:57

source share

shylent · Accepted Answer · 2010-01-02T07:46:49+0000

You can throw an exception and wrap your .feed() call in a try block.

You can also call self.reset() when you decide that you are done (I haven’t actually tried, but according to the documentation “Reset instance. Loses all raw data.” Is exactly what you need).

Abort HTMLParser Processing in Python

More articles: