Parsing HTML in python - lxml or BeautifulSoup? Which one is better for what purpose?

From what I can understand, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I chose BeautifulSoup for the project I am working on, but I did not choose it for any reason other than finding syntax that is a little easier to learn and understand. But I see that many people prefer lxml, and I heard that lxml is faster.

So I wonder what are the advantages of one over the other? When do I want to use lxml and when should I use BeautifulSoup? Are there any other libraries worth considering?

+50
python html-parsing lxml beautifulsoup
Dec 17 '09 at 2:08
source share
7 answers

For starters, BeautifulSoup is no longer supported, and the author even recommends alternatives such as lxml.

Quote from the linked page:

Version 3.1.0 of Beautiful Soup does significantly worse on real HTML than version 3.0.8. Most common problems are incorrect tags, β€œincorrect start” tag, and β€œbad end tag.” This page explains what happened, how the problem is resolved, and what you can do right now.

This page was originally written by March 2009. Since then, the 3.2 series has been released, replacing the 3.1 series and the development of the 4.x series has begun. This page will remain for goals.

TL; DR

Use 3.2.0 instead.

+22
Dec 17 '09 at 14:13
source share

Pyquery provides a jQuery selector interface for Python (using lxml under the hood).

http://pypi.python.org/pypi/pyquery

This is really awesome, I don't use anything else.

+25
Dec 17 '09 at 18:48
source share

Thus, lxml positioned as a high-speed html and xml performance parser, which, by the way, also includes the soupparser module to return to BeautifulSoup functionality. BeautifulSoup is a one-person project designed to save time in order to quickly extract data from poorly formed html or xml.

lxml documentation says both parsers have advantages and disadvantages. For this reason, lxml provides soupparser , so you can switch back and forth. Citation,

BeautifulSoup takes a different parsing approach. This is not a real HTML parser, but uses regular expressions to dive through soup soup. it is therefore in some cases more forgiving and less good in others. it is not uncommon that lxml / libxml2 parses and fixes broken HTML better, but BeautifulSoup has super modern support for detecting encoding. It very much depends on which parser works best.

At the end they say:

The disadvantage of using this parser is that it is much slower than the lxml HTML parser. Therefore, if performance is important, you might want to consider using soupparser only as a reserve for certain cases.

If I understand them correctly, this means that the soup parser is more reliable - it can deal with the β€œsoup” of incorrect tags using regular expressions, while lxml is simpler and just parses things and builds a tree, as you would expect. I believe this also applies to BeautifulSoup , not just soupparser for lxml .

They also show how to capitalize on BeautifulSoup coding detection, but are still quick to deal with lxml :

 >>> from BeautifulSoup import UnicodeDammit >>> def decode_html(html_string): ... converted = UnicodeDammit(html_string, isHTML=True) ... if not converted.unicode: ... raise UnicodeDecodeError( ... "Failed to detect encoding, tried [%s]", ... ', '.join(converted.triedEncodings)) ... # print converted.originalEncoding ... return converted.unicode >>> root = lxml.html.fromstring(decode_html(tag_soup)) 

(Same source: http://lxml.de/elementsoup.html ).

In the words of the creator of BeautifulSoup

What is it! Have some fun! I wrote Beautiful Soup to save everyone. Once you get used to it, you should be able to nip data from poorly designed websites in just a few minutes. Send me an email if you have any comments, run into problems or want me to know about your project that uses Beautiful Soup.

  --Leonard 

Quote from Documentation Beautiful Soup .

Hope this is clear now. Soup is a brilliant one-man project designed to save time for extracting data from poorly designed sites. The goal is to save your time right now, to do this work, not necessarily to save your time in the long run, and certainly not optimize the performance of your software.

Also from lxml ,

lxml has been loaded from the Python package index more than two million times and is also available directly in many distribution packages, for example. for Linux or MacOS-X.

And, from Why lxml? ,

The C libraries libxml2 and libxslt have huge advantages: ... Standard compatible ... Full featured ... fast. quickly! FAST! ... lxml is the new Python binding for libxml2 and libxslt ...

+13
23 Oct '13 at 18:25
source share

Don't use BeautifulSoup, use lxml.soupparser , then you are sitting on top of the power of lxml and you can use the nice BeautifulSoup bits that should deal with really broken and crappy HTML.

+11
Dec 17 '09 at 14:24
source share

I used lxml with great success to parse HTML. It seems nice to work with "super" HTML. I would highly recommend it.

Here is a quick test that I lay to try to process some ugly HTML:

 import unittest from StringIO import StringIO from lxml import etree class TestLxmlStuff(unittest.TestCase): bad_html = """ <html> <head><title>Test!</title></head> <body> <h1>Here a heading <p>Here some text <p>And some more text <b>Bold!</b></i> <table> <tr>row <tr><td>test1 <td>test2 </tr> <tr> <td colspan=2>spanning two </table> </body> </html>""" def test_soup(self): """Test lxml parsing of really bad HTML""" parser = etree.HTMLParser() tree = etree.parse(StringIO(self.bad_html), parser) self.assertEqual(len(tree.xpath('//tr')), 3) self.assertEqual(len(tree.xpath('//td')), 3) self.assertEqual(len(tree.xpath('//i')), 0) #print(etree.tostring(tree.getroot(), pretty_print=False, method="html")) if __name__ == '__main__': unittest.main() 
+5
Dec 17 '09 at 14:19
source share

Of course I would use EHP. It is faster than lxml, much more elegant and easy to use.

Departure. https://github.com/iogf/ehp

 <body ><em > foo <font color="red" ></font></em></body> from ehp import * data = '''<html> <body> <em> Hello world. </em> </body> </html>''' html = Html() dom = html.feed(data) for ind in dom.find('em'): print ind.text() 

Output:

 Hello world. 
+1
Mar 20 '16 at 10:03
source share

A somewhat outdated speed comparison can be found here , which explicitly recommends lxml, as the speed differences seem sharp.

0
Dec 08
source share



All Articles