HTML parser for GAE

I usually use lxml for my HTML parsing needs, but this is not available in the Google App Engine. The obvious alternative is BeautifulSoup , but I find that it throttles too easily on incorrect HTML. I am currently testing libxml2dom and have achieved better results.

What pure Python HTML parser did you find works best? My priority is the ability to handle bad HTML in speed.

+5
source share
2 answers
+5
source

From the BeautifulSoup Documentation :

Version 3.1.0 of Beautiful Soup is significantly worse on real HTML than version 3.0.8.

Thus, it can help you use this earlier version. That is what the author himself recommends.

You can pretend Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine in Python 2.3 through 2.6.

+5
source

All Articles