Which of lxml and libxml2 is better for parsing malformed html in Python?

Question

Which of lxml and libxml2 is better for parsing malformed html in Python?

Which one is better and more useful for distorted html?
I can not find how to use libxml2.

Thanks.

+7

python html-parsing lxml libxml2

bloody numen Feb 17 '12 at 7:22

source share

4 answers

Instead, try beutifulsoup. It is designed to analyze poorly structured data.

http://pypi.python.org/pypi/BeautifulSoup

http://lxml.de/elementsoup.html

+2

John p Feb 17 '12 at 7:26

source share

BeautifulSoup is useful to parse html. You can check his example and find that it is a good comparison with others.

+1

Nilesh Feb 17 '12 at 7:36

source share

lxml is the one that is usually recommended. In particular, lxml.html (if I remember correctly).

I believe that he uses libxml2 under the hood, but returns to beautifulsoup if html is especially nasty, but don’t take my word for it, check the site! ( http: // http: //lxml.de/ )

0

Arafangion Feb 17 '12 at 7:25

source share

jcollado · Accepted Answer · 2012-02-17T07:36:47+0000

On the libxml2 page page, you can see this note:

Please note that some of the Python purists dislike the default Python bundle set, instead of complaining, I suggest that they look at lxml for more pythonic bindings for libxml2 and libxslt and check the mailing list.

and on the lxml page this one is different:

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and completeness of the XML functions of these libraries with the simplicity of its own Python API, which is basically compatible but superior to the well-known ElementTree API.

So with lxml you get exactly the same functionality, but with the pythonic API compatible with the ElementTree library in the standard library (so that means the standard library documentation will be useful to learn how to use lxml ). Therefore, lxml preferable to libxml2 (even if the underlying implementation is the same).

Edit: Having said that, as the other answers explain, it is best to use BeautifulSoup to parse the invalid HTML. It is interesting to note that if you installed lxml , BeautifulSoup will use it as described in the documentation for the new version

If you do not specify anything, you will get the best HTML parser that is installed. Beautiful Soup rates lxmls parser as the best, then html5libs and then Pythons built-in parser.

In any case, even if BeautifulSoup uses lxml under the hood, you can parse the broken html , which you cannot parse with xml directly. For example:

 >>> lxml.etree.fromstring('<html>') ... XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7

But:

 >>> bs4.BeautifulSoup('<html>', 'lxml') <html></html>

Finally, note that lxml also provides an interface for the old version of BeautifulSoup as follows:

 >>> lxml.html.soupparser.fromstring('<html>') <Element html at 0x13bd230>

So, at the end of the day, you are likely to use lxml and BeautifulSoup . The only thing you need to choose is the very API that you like best.

Which of lxml and libxml2 is better for parsing malformed html in Python?

More articles: