How to use Cleaner, lxml.html without returning div tag?

Question

How to use Cleaner, lxml.html without returning div tag?

I have this code:

evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>" cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'], page_structure=True) print cleaner.clean_html(evil)

I expected to get the following:

 <b>bold text</b>italic text

But instead, I get the following:

 <div><b>bold text</b>italic text</div>

Is there an attribute to remove the shell of a div tag?

+8

python lxml.html

Allan veloso Jan 29 '14 at 2:28

source share

2 answers

A cleaner always wraps the result in an element. A good solution is to manually parse the HTML code and send the resulting document object to cleaner-, then the result will also be a document object, and you can use text_content to extract the text from the root.

 from lxml.html import document_fromstring from lxml.html.clean import Cleaner evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>" doc = document_fromstring(evil) cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'], page_structure=True) print cleaner.clean_html(doc).text_content()

It can also be done as a single liner.

0

cmc Jan 16 '19 at 12:11

source share

Hugh bothwell · Accepted Answer · 2014-01-29T02:36:14+0000

lxml expects your html to have a tree structure, i.e. a single root node. If he does not have one, he adds it.

How to use Cleaner, lxml.html without returning div tag?

More articles: