Title

A long text........

a l...">

Python code to remove HTML tags from a string

I have a text like this:

text = """<div> <h1>Title</h1> <p>A long text........ </p> <a href=""> a link </a> </div>""" 

using pure Python, without an external module I want to have this:

 >>> print remove_tags(text) Title A long text..... a link 

I know that I can do this using lxml.html.fromstring (text) .text_content () , but I need to achieve the same thing in pure Python using the built-in or std library for 2.6 +

How can i do this?

+108
python string html xml parsing
Mar 12 2018-12-12T00:
source share
5 answers

Using regex

Using regular expressions, you can clear everything inside <> :

 import re def cleanhtml(raw_html): cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, '', raw_html) return cleantext 

Some HTML texts may also contain entities that are not enclosed in square brackets, such as ' &nsbm '. If so, then you can write a regex like

 cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});') 

This link contains more information about this.

Using BeautifulSoup

You can also use the optional BeautifulSoup package to find out all the raw text.

When calling BeautifulSoup, you will need to explicitly install the parser. I recommend using "lxml" as indicated in alternative answers (much more reliable than standard (that is, available without additional installation) "html.parser".

 from bs4 import BeautifulSoup cleantext = BeautifulSoup(raw_html, "lxml").text 

But this does not stop you from using external libraries, so I recommend the first solution.

+193
Oct 19 '12 at 21:26
source share

Python has several built-in XML modules. The easiest one for the case when you already have a line with full HTML, xml.etree , which works (somewhat) similarly to the lxml example that you specify:

 def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext()) 
+37
Mar 12 2018-12-12T00:
source share

Please note that this is not ideal, as if you had something like, say, <a title=">"> it would be <a title=">"> . However, this is the closest to non-library Python without a really complex function:

 import re TAG_RE = re.compile(r'<[^>]+>') def remove_tags(text): return TAG_RE.sub('', text) 

However, as xml.etree mentions, it is available in the Python standard library, so you can probably just adapt it to use it as an existing version of lxml :

 def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext()) 
+27
Mar 12 2018-12-12T00:
source share

There is an easy way to do this in any C-like language. The style is not Pythonic, but works with pure Python:

 def remove_html_markup(s): tag = False quote = False out = "" for c in s: if c == '<' and not quote: tag = True elif c == '>' and not quote: tag = False elif (c == '"' or c == "'") and tag: quote = not quote elif not tag: out = out + c return out 

An idea based on a simple end state machine is described in detail here: http://youtu.be/2tu9LTDujbw

You can see how it works here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you are interested in a class (about smart debugging using python), I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1 . It's free!

+5
Jan 22 '13 at 17:27
source share
 global temp temp ='' s = ' ' def remove_strings(text): global temp if text == '': return temp start = text.find('<') end = text.find('>') if start == -1 and end == -1 : temp = temp + text return temp newstring = text[end+1:] fresh_start = newstring.find('<') if newstring[:fresh_start] != '': temp += s+newstring[:fresh_start] remove_strings(newstring[fresh_start:]) return temp 
-6
Feb 25 '13 at 9:39
source share



All Articles