Libxml Cleaner adds unwanted <p> tag to HTML snippets

Question

Libxml Cleaner adds unwanted <p> tag to HTML snippets

I am trying to misinform user input in order to prevent XSS injection using the libxml HTML cleaner. When I enter a line like this:

Normal text <b>Bold text</b>

I get this instead:

<p>Normal text <b>Bold text</b></p>

I want to get rid of the tag <p>that surrounds my entire input.

Here is the function that is currently doing the cleanup:

from lxml.html import clean

cleaner = clean.Cleaner(
    scripts = True,
    javascript = True,
    allow_tags = None,
)

def sanitize_html(html):
    return cleaner.clean_html(html)

In an unrelated note, the above code has one line: allow_tags = Nonewhere I am trying to remove all the HTML tags. Does libxml have a whitelist function where I only allow certain tags?

+5

python parsing libxml2

Wylie Jun 23 '11 at 2:46

source share

1 answer

Sean · Accepted Answer · 2011-06-23T07:09:53+0000

TEXT / - . libxml .

def sanitize_html(html):
    cleaned_html = cleaner.clean_html(html)
    return re.sub(r'</p>$', '', re.sub(r'^<p>', '', cleaned_html))

. libxml2, , :

return cleaned_html[3:-4]     # Single slice operation
return cleaned_html[3:][:-4]

Libxml Cleaner adds unwanted <p> tag to HTML snippets

More articles: