malignus scriptbold textitalic text...">

How to use Cleaner, lxml.html without returning div tag?

I have this code:

evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>" cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'], page_structure=True) print cleaner.clean_html(evil) 

I expected to get the following:

 <b>bold text</b>italic text 

But instead, I get the following:

 <div><b>bold text</b>italic text</div> 

Is there an attribute to remove the shell of a div tag?

+8
python lxml.html
source share
2 answers

lxml expects your html to have a tree structure, i.e. a single root node. If he does not have one, he adds it.

+13
source share

A cleaner always wraps the result in an element. A good solution is to manually parse the HTML code and send the resulting document object to cleaner-, then the result will also be a document object, and you can use text_content to extract the text from the root.

 from lxml.html import document_fromstring from lxml.html.clean import Cleaner evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>" doc = document_fromstring(evil) cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'], page_structure=True) print cleaner.clean_html(doc).text_content() 

It can also be done as a single liner.

0
source share

All Articles