Removing broken tags and poorly formatted html from some text

I have a huge database of scraped messages that I embed in a website. however, many people try to use html in their forum posts and often do it wrong. because of this, shadow tags <strike> <b> </strike> </div> </b> are always present in the posts, which will eventually ruin the web page format when I add 15 posts in the forum.

now I just add all possible end tags to the message only so that it can catch any open tag ... is there a better way to do this without understanding the text and trying to manually delete each open tag. for forum posts, loooooong is an expensive transaction for a web application.

+4
source share
3 answers

Check out HTML Tidy

There is also a Python lib shell: μTidylib

Alternatively there is HTML Cleanup

+1
source

Beautiful Soup does a decent job cleaning HTML.

0
source

Take a look at lxml .

0
source

All Articles