Processing HTML file using Python

I wanted to remove all tags in the HTML file. For this, I used the python re-module. For example, consider a string <h1>Hello World!</h1>. I want to save only "Hello World!". To remove tags, I used re.sub('<.*>','',string). For obvious reasons, the result I get is an empty string (Regexp identifies the first and last angle brackets and removes everything between them). How do I solve this problem?

0
source share
5 answers

You can make the match inanimate: '<.*?>'

You also need to be careful, HTML is a tricky beast and can interfere with your regular expressions.

+1
source

Parse the HTML with BeautifulSoup, then get only the text.

+1

, lxml BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

:

HTML: ?

HTML/XML:

+1

Beautiful Soup html!

You may not need this, but you should learn how to use it. Also will help you in the future.

0
source

All Articles