How to use Cleaner, lxml.html without returning div tag?
I have this code:
evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>" cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'], page_structure=True) print cleaner.clean_html(evil) I expected to get the following:
<b>bold text</b>italic text But instead, I get the following:
<div><b>bold text</b>italic text</div> Is there an attribute to remove the shell of a div tag?
lxml expects your html to have a tree structure, i.e. a single root node. If he does not have one, he adds it.
A cleaner always wraps the result in an element. A good solution is to manually parse the HTML code and send the resulting document object to cleaner-, then the result will also be a document object, and you can use text_content to extract the text from the root.
from lxml.html import document_fromstring from lxml.html.clean import Cleaner evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>" doc = document_fromstring(evil) cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'], page_structure=True) print cleaner.clean_html(doc).text_content() It can also be done as a single liner.