Removing html image tags and everything that happens between the line

I saw some questions about removing HTML tags from strings, but I still don't understand a bit how my particular case should be handled.

I have seen that many posts advise against using regular expressions to process HTML, but I suspect my case may justify a reasonable circumvention of this rule.

I am trying to parse PDF files, and I was able to successfully convert each page from my sample PDF file to a UTF-32 text string. When images appear, an HTML-style tag is inserted that contains the name and location of the image (which is saved elsewhere).

In a separate part of my application, I need to get rid of these image tags. Since we are dealing only with image tags, I suspect that using a regular expression may be warranted.

My question is twofold:

  • Should I use regex to remove these tags or should I use an HTML parser like BeautifulSoup?
  • Which regex or BeautifulSoup construct should I use? In other words, how do I code this?

For clarity, tags are structured as <img src="/path/to/file"/>

Thanks!

+7
source share
3 answers

I would vote that in your case it is acceptable to use regular expression. Something like this should work:

 def remove_html_tags(data): p = re.compile(r'<.*?>') return p.sub('', data) 

I found this snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)

edit: version that will only remove form elements <img .... /> :

 def remove_img_tags(data): p = re.compile(r'<img.*?/>') return p.sub('', data) 
+8
source

Since this text contains only image tags, it may be ok to use regex. But for something else, you're probably better off using bonafide's HTML parser. Fortunately, Python provides one! These are fairly bare bones - to be fully functional, this would have to handle much more angular cases. (In particular, empty XHTML-style tags (ending with a slash <... /> ) are not processed here.)

 >>> from HTMLParser import HTMLParser >>> >>> class TagDropper(HTMLParser): ... def __init__(self, tags_to_drop, *args, **kwargs): ... HTMLParser.__init__(self, *args, **kwargs) ... self._text = [] ... self._tags_to_drop = set(tags_to_drop) ... def clear_text(self): ... self._text = [] ... def get_text(self): ... return ''.join(self._text) ... def handle_starttag(self, tag, attrs): ... if tag not in self._tags_to_drop: ... self._text.append(self.get_starttag_text()) ... def handle_endtag(self, tag): ... self._text.append('</{0}>'.format(tag)) ... def handle_data(self, data): ... self._text.append(data) ... >>> td = TagDropper([]) >>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n') >>> print td.get_text() A line of text A line of text with an <img url="foo"> tag Another line of text with a <br> tag 

And to drop the img tags ...

 >>> td = TagDropper(['img']) >>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n') >>> print td.get_text() A line of text A line of text with an tag Another line of text with a <br> tag 
+3
source

My decision:

 def remove_HTML_tag(tag, string): string = re.sub(r"<\b(" + tag + r")\b[^>]*>", r"", string) return re.sub(r"<\/\b(" + tag + r")\b[^>]*>", r"", string) 
0
source

All Articles