I use BeautifulSoup and urllib2 to load and parse HTML pages. The problem is malformed HTML pages. Although BeautifulSoup does a good job of malformed HTML, it is still not as good as Firefox.
Given that Firefox or Webkit is more up-to-date and robust when processing HTML, I think it is ideal to use them to build and normalize the DOM tree on a page, and then manipulate it through Python.
However, I cannot find the python binding for the same. Can anyone suggest a way?
I came across some solutions to launch the mute Firefox process and manipulate it through python, but is there an even more affordable pythonic solution.
python html parsing
user90147
source share