DOMDocument interface for python lxml

I wrote a small application that should have access to the DOM representation of a basic HTML page. Lxml is really wonderful, but I could not find such an interface. Does anyone know if it exists or is there another tool that does this?

+5
source share
4 answers

According to the lxml documentation , you can use lxml to parse a document, and its SAX parser can interact with Python xml.dom.pulldom to create a DOM object. From the documentation, the code may look like this:

from xml.dom.pulldom import SAX2DOM
handler = SAX2DOM()
lxml.sax.saxify(tree, handler)
dom = handler.document
+2
source

There is an example parsing HTML on an lxml site :

>>> from lxml import etree
>>> from StringIO import StringIO

>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> tree   = etree.parse(StringIO(broken_html), parser)

>>> result = etree.tostring(tree.getroot(),
...                         pretty_print=True, method="html")
>>> print(result)
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <h1>page title</h1>
  </body>
</html>

tree.find, tree.findall, tree.iter, tree.xpath . :

>>> tree.getroot().getchildren()
[<Element head at 0x4f4ad38>, <Element body at 0x4f4ad80>]

>>> tree.getroot()..find('body')
<Element body at 0x4f4ad80>

Python, :

>>> from xml.dom.pulldom import SAX2DOM
>>> handler = SAX2DOM()
>>> lxml.sax.saxify(tree, handler)

>>> dom = handler.document
>>> print(dom.firstChild.localName)

, lxml API dom/minidom.

+2
0
source

I used minidom (in particular, example 19.7.2) for several projects where a DOM view is required.

This turned out to be useful for parsing xml configuration files and cleaning up poorly written HTML. I would like to instill your confidence in the mini-minim, because it was such a useful tool in practice!

0
source

All Articles