Abort HTMLParser Processing in Python

When using the HTMLParser class in Python, is it possible to interrupt processing in the handle_* function? At the beginning of the processing, I get all the necessary data, so it seems that the waste continues to be processed. The following is an example of extracting metadata for a document.

 from HTMLParser import HTMLParser class MyParser(HTMLParser): def handle_start(self, tag, attrs): in_meta = False if tag == 'meta': for attr in attrs: if attr[0].lower() == 'name' and attr[1].lower() == 'description': in_meta = True if attr[0].lower() == 'content': print(attr[1]) # Would like to tell the parser to stop now, # since I have all the data that I need 
+4
source share
3 answers

You can throw an exception and wrap your .feed() call in a try block.

You can also call self.reset() when you decide that you are done (I havenโ€™t actually tried, but according to the documentation โ€œReset instance. Loses all raw data.โ€ Is exactly what you need).

+9
source

If you use the pyparsing scanString method, you have more control over how far you go through the input string. In your example, we create an expression that matches the <meta> tag and add a parsing action that ensures that only the tag matches with name="description" . This code assumes that you have read the HTML pages in the htmlsrc variable:

 from pyparsing import makeHTMLTags, withAttribute # makeHTMLTags creates both open and closing tags, only care about the open tag metaTag = makeHTMLTags("meta")[0] metaTag.setParseAction(withAttribute(name="description")) try: # scanString is a generator that returns each match as it is found # in the input tokens,startloc,endloc = metaTag.scanString(htmlsrc).next() # attributes can be accessed like object attributes if they are # valid Python names print tokens.content # if the attribute name clashes with a Python keyword, or is # otherwise unsuitable as an identifier, use dict-like access instead print tokens["content"] except StopIteration: print "no matching meta tag found" 
+1
source

@Shylent extension of the answer, here is my solution:

 class MyParser(HTMLParser): boolean_flag = False def handle_starttag(self, tag, attrs): # for example: self.boolean_flag = (tag == "sometag" and ("id", "someid") in attrs) def handle_endtag(self, tag): pass def handle_data(self, data): if self.boolean_flag: raise DataParsedException(data) class DataParsedException(Exception): def __init__(self, data): self.data = data 

Using:

 try: parser.feed(html.decode()) except DataParsedException as dataParsed: vars.append(dataParsed.data) 

Performs this work.

0
source

All Articles