Removing Python HTML

Question

Removing Python HTML

How to remove all HTML from a string in Python? For example, how can I rotate:

blah blah <a href="blah">link</a>

in

 blah blah link

Thanks!

+6

python string

user29772 Feb 28 '09 at 10:39

source share

9 answers

When your regex solution hits the wall, try this super-easy (and reliable) BeautifulSoup program.

 from BeautifulSoup import BeautifulSoup html = "<a> Keep me </a>" soup = BeautifulSoup(html) text_parts = soup.findAll(text=True) text = ''.join(text_parts)

+18

Triptych Mar 01 '09 at 2:00

source share

There is also a small library called stripogram that can be used to remove some or all of the HTML tags.

You can use it as follows:

 from stripogram import html2text, html2safehtml # Only allow <b>, <a>, <i>, <br>, and <p> tags clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p")) # Don't process <img> tags, just strip them out. Use an indent of 4 spaces # and a page that 80 characters wide. text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So, if you just want to remove all HTML, you pass valid_tags = () to the first function.

Here you can find.

+10

MrTopf Mar 01 '09 at 14:45

source share

Regexs, BeautifulSoup, html2text do not work if it has a ' > ' in it . See Is ">" (U + 003E GREATER-THAN SIGN) valid inside an attribute value of an html element?

'A solution based on an HTML / XML parser can help in such cases, for example, the stripogram proposed by @MrTopf works.

Here's the ElementTree solution :

 ####from xml.etree import ElementTree as etree # stdlib from lxml import etree str_ = 'blah blah <a href="blah">link</a> END' root = etree.fromstring('<html>%s</html>' % str_) print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

 blah blah link END

+5

jfs Mar 01 '09 at 20:42

source share

Try Beautiful soup . Drop everything except text.

+3

George V. Reilly Feb 28 '09 at 10:52

source share

html2text will do something like this.

+2

Rexe Mar 01 '09 at 18:38

source share

I just wrote this. I need it. It uses html2text and accepts the file path, although I would prefer a URL. The output of html2text is stored in TextFromHtml2Text.text print it, save, submit it to your favorite canary.

 import html2text class TextFromHtml2Text: def __init__(self, url = ''): if url == '': raise TypeError("Needs a URL") self.text = "" self.url = url self.html = "" self.gethtmlfile() self.maytheswartzbewithyou() def gethtmlfile(self): file = open(self.url) for line in file.readlines(): self.html += line def maytheswartzbewithyou(self): self.text = html2text.html2text(self.html)

+1

David Kent Snyder Jun 29 '12 at 17:41

source share

A simple way:

 def remove_html_markup(s): tag = False quote = False out = "" for c in s: if c == '<' and not quote: tag = True elif c == '>' and not quote: tag = False elif (c == '"' or c == "'") and tag: quote = not quote elif not tag: out = out + c return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see how it works here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you are interested in a class (about smart debugging using python), I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1 . It's free!

Welcome!:)

+1

Medeiros Jan 22 '13 at 17:31

source share

 >>> import re >>> s = 'blah blah <a href="blah">link</a>' >>> q = re.compile(r'<.*?>', re.IGNORECASE) >>> re.sub(q, '', s) 'blah blah link'

0

riza Feb 28 '09 at 23:23

source share

Luke woodward · Accepted Answer · 2009-02-28T22:43:17+0000

You can use regex to remove all tags:

 >>> import re >>> s = 'blah blah <a href="blah">link</a>' >>> re.sub('<[^>]*>', '', s) 'blah blah link'

Removing Python HTML

More articles: