Python code to remove HTML tags from a string
I have a text like this:
text = """<div> <h1>Title</h1> <p>A long text........ </p> <a href=""> a link </a> </div>""" using pure Python, without an external module I want to have this:
>>> print remove_tags(text) Title A long text..... a link I know that I can do this using lxml.html.fromstring (text) .text_content () , but I need to achieve the same thing in pure Python using the built-in or std library for 2.6 +
How can i do this?
Using regex
Using regular expressions, you can clear everything inside <> :
import re def cleanhtml(raw_html): cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, '', raw_html) return cleantext Some HTML texts may also contain entities that are not enclosed in square brackets, such as ' &nsbm '. If so, then you can write a regex like
cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});') This link contains more information about this.
Using BeautifulSoup
You can also use the optional BeautifulSoup package to find out all the raw text.
When calling BeautifulSoup, you will need to explicitly install the parser. I recommend using "lxml" as indicated in alternative answers (much more reliable than standard (that is, available without additional installation) "html.parser".
from bs4 import BeautifulSoup cleantext = BeautifulSoup(raw_html, "lxml").text But this does not stop you from using external libraries, so I recommend the first solution.
Python has several built-in XML modules. The easiest one for the case when you already have a line with full HTML, xml.etree , which works (somewhat) similarly to the lxml example that you specify:
def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext()) Please note that this is not ideal, as if you had something like, say, <a title=">"> it would be <a title=">"> . However, this is the closest to non-library Python without a really complex function:
import re TAG_RE = re.compile(r'<[^>]+>') def remove_tags(text): return TAG_RE.sub('', text) However, as xml.etree mentions, it is available in the Python standard library, so you can probably just adapt it to use it as an existing version of lxml :
def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext()) There is an easy way to do this in any C-like language. The style is not Pythonic, but works with pure Python:
def remove_html_markup(s): tag = False quote = False out = "" for c in s: if c == '<' and not quote: tag = True elif c == '>' and not quote: tag = False elif (c == '"' or c == "'") and tag: quote = not quote elif not tag: out = out + c return out An idea based on a simple end state machine is described in detail here: http://youtu.be/2tu9LTDujbw
You can see how it works here: http://youtu.be/HPkNPcYed9M?t=35s
PS - If you are interested in a class (about smart debugging using python), I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1 . It's free!
global temp temp ='' s = ' ' def remove_strings(text): global temp if text == '': return temp start = text.find('<') end = text.find('>') if start == -1 and end == -1 : temp = temp + text return temp newstring = text[end+1:] fresh_start = newstring.find('<') if newstring[:fresh_start] != '': temp += s+newstring[:fresh_start] remove_strings(newstring[fresh_start:]) return temp