Python HTMLParser - stop parsing

Question

Python HTMLParser - stop parsing

I am using the Python HTMLParser module from html.parser . I am looking for one tag, and when it is found, it makes sense to stop parsing. Is it possible? I tried calling close() , but I'm not sure if this is the way to go.

 class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): login_form = False if tag == "form": print("finished") self.close()

However, this is similar to a recursive effect ending in

  File "/usr/lib/python3.4/re.py", line 282, in _compile p, loc = _cache[type(pattern), pattern, flags] RuntimeError: maximum recursion depth exceeded in comparison

+4

python dom html

dev-null May 17, '15 at 8:45

source share

1 answer

Constance · Answer 1 · 2018-03-20T13:35:04+0000

According to the docs, the close() method does this:

Forces all buffered data as if it were followed by a mark at the end of the file.

You are still inside handle_starttag and you have not finished working with the buffer, so you definitely do not want to process all the buffered data - this is why you are stuck with recursion. You cannot stop the machine inside the device.

From the description of reset() it looks more like what you want:

Reset instance. Loses all raw data.

but also it cannot be called from what it calls, therefore it also shows recursion.

It looks like you have two options:

raise an exception (for example, a StopIteration ) and catch it from your call to the parser. Depending on what else you do when parsing, this may save the information you need. You may need to do some checks to see that the files are not left open.
use a simple flag ( True / False ) to indicate whether you were interrupted or not. At the very start of handle_starttag just exit if aborted. Thus, the mechanism will still go through all the html tags, but do nothing for everyone. Obviously, if you handle handle_endtag , then this will also check the flag. You can return the flag to its normal state either upon receipt of the <html> tag or by overwriting the feed method.

Python HTMLParser - stop parsing

More articles: