Using Gecko / Firefox or Webkit got HTML parsing in python

I use BeautifulSoup and urllib2 to load and parse HTML pages. The problem is malformed HTML pages. Although BeautifulSoup does a good job of malformed HTML, it is still not as good as Firefox.

Given that Firefox or Webkit is more up-to-date and robust when processing HTML, I think it is ideal to use them to build and normalize the DOM tree on a page, and then manipulate it through Python.

However, I cannot find the python binding for the same. Can anyone suggest a way?

I came across some solutions to launch the mute Firefox process and manipulate it through python, but is there an even more affordable pythonic solution.

+6
python html parsing
source share
3 answers

Perhaps pywebkitgtk will do what you need.

+1
source share

see http://wiki.python.org/moin/WebBrowserProgramming

There are many options - I support the page above so that I don't repeat myself.

you should look at pajamas-desktop: see uitest example / example, because we use this trick to get copies of the HTML "out" page, so that the python-to-javascript compiler can be tested by comparing the results page after each unit test.

each of the desktops supported and used by pajamas descriptors is capable of allowing access to the "innerHTML" property of the document body element (and, hell, a lot more).

bottom line: it’s trivial to do what you want to do, but you need to know where to look, how to do it.

. L

+1
source share

You may like PyWebkitDFB from http://www.gnu.org/software/pythonwebkit/

0
source share

All Articles