I usually use lxml for my HTML parsing needs, but this is not available in the Google App Engine. The obvious alternative is BeautifulSoup , but I find that it throttles too easily on incorrect HTML. I am currently testing libxml2dom and have achieved better results.
What pure Python HTML parser did you find works best? My priority is the ability to handle bad HTML in speed.
source
share