BeautifulSoup and lxml.html - what to choose?

Question

BeautifulSoup and lxml.html - what to choose?

I am working on a project that will include HTML analysis.

After searching, I found two possible options: BeautifulSoup and lxml.html

Is there any reason to prefer each other? I used lxml for XML some time ago, and I feel that I will be more comfortable with it, however BeautifulSoup seems very common.

I know that I should use one that works for me, but I was looking for personal experience with both.

+21

python lxml beautifulsoup

user225312 Feb 11

source share

4 answers

Thus, lxml positioned as a high-speed html and xml performance parser, which, by the way, also includes the soupparser module to return to BeautifulSoup functionality. BeautifulSoup is a one-person project designed to save time in order to quickly extract data from poorly formed html or xml.

lxml documentation says both parsers have advantages and disadvantages. For this reason, lxml provides soupparser , so you can switch back and forth. Citation,

BeautifulSoup takes a different parsing approach. This is not a real HTML parser, but uses regular expressions to dive through soup soup. it is therefore in some cases more forgiving and less good in others. it is not uncommon that lxml / libxml2 parses and fixes broken HTML better, but BeautifulSoup has super modern support for detecting encoding. It very much depends on which parser works best.

At the end they say:

The disadvantage of using this parser is that it is much slower than the lxml HTML parser. Therefore, if performance is important, you might want to consider using soupparser only as a reserve for certain cases.

If I understand them correctly, this means that the soup parser is more reliable - it can deal with the “soup” of incorrect tags using regular expressions, while lxml is simpler and just parses things and builds a tree, as you would expect. I believe this also applies to BeautifulSoup , not just soupparser for lxml .

They also show how to capitalize on BeautifulSoup coding detection, but are still quick to deal with lxml :

 >>> from BeautifulSoup import UnicodeDammit >>> def decode_html(html_string): ... converted = UnicodeDammit(html_string, isHTML=True) ... if not converted.unicode: ... raise UnicodeDecodeError( ... "Failed to detect encoding, tried [%s]", ... ', '.join(converted.triedEncodings)) ... # print converted.originalEncoding ... return converted.unicode >>> root = lxml.html.fromstring(decode_html(tag_soup))

(Same source: http://lxml.de/elementsoup.html ).

In the words of the creator of BeautifulSoup

What is it! Have some fun! I wrote Beautiful Soup to save everyone. Once you get used to it, you should be able to nip data from poorly designed websites in just a few minutes. Send me an email if you have any comments, run into problems or want me to know about your project that uses Beautiful Soup.
  --Leonard 

Quote from Documentation Beautiful Soup .

Hope this is clear now. Soup is a brilliant one-man project designed to save time for extracting data from poorly designed sites. The goal is to save your time right now, to do this work, not necessarily to save your time in the long run, and certainly not optimize the performance of your software.

Also from lxml ,

lxml has been loaded from the Python package index more than two million times and is also available directly in many distribution packages, for example. for Linux or MacOS-X.

And, from Why lxml? ,

The C libraries libxml2 and libxslt have huge advantages: ... Standard compatible ... Full featured ... fast. quickly! FAST! ... lxml is the new Python binding for libxml2 and libxslt ...

+14

osa 23 Oct '13 at 17:48

source share

Use both? lxml for DOM manipulation, BeautifulSoup for parsing:

http://lxml.de/elementsoup.html

+2

ymv Feb 11 2018-11-11T00:

source share

lxml is great. But parsing your input as html is only useful if the dom structure really helps you find what you are looking for.

Can you use regular string functions or regular expressions? For many html parsing tasks, treating your input as a string rather than an html document is a counterintuitively simple way.

0

dfichter Feb 11 '11 at 11:30

source share

simon · Accepted Answer · 2011-02-11 08:51

The simple answer, imo, is that if you trust your source to be well-formed, go to the lxml solution. Otherwise, BeautifulSoup completely.

Edit:

This answer is already three years old; it's worth mentioning, as Jonathan Vanasco notes in the comments, that BeautifulSoup4 now supports lxml as an internal parser, so you can use the advanced functions and BeautifulSoup interface without much performance if you want (although I’m hiding right behind lxml - maybe it's just a force habits :)).

BeautifulSoup and lxml.html - what to choose?

More articles: