Processing HTML file using Python

Question

Processing HTML file using Python

I wanted to remove all tags in the HTML file. For this, I used the python re-module. For example, consider a string <h1>Hello World!</h1>. I want to save only "Hello World!". To remove tags, I used re.sub('<.*>','',string). For obvious reasons, the result I get is an empty string (Regexp identifies the first and last angle brackets and removes everything between them). How do I solve this problem?

0

python regex

PaulDaviesC Oct 08 '11 at 3:34

source share

5 answers

Parse the HTML with BeautifulSoup, then get only the text.

+1

Sunjay Varma 08 . '11 3:36

: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: , , . , . http://lxml.de/

+1

akonsu 08 . '11 3:39

, lxml BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

:

HTML: ?

HTML/XML:

+1

Marco Mariani 08 . '11 3:55

Beautiful Soup html!

You may not need this, but you should learn how to use it. Also will help you in the future.

0

varunl Oct 08 '11 at 6:22

source share

Ned batchelder · Accepted Answer · 2011-10-08T03:38:55+0000

You can make the match inanimate: '<.*?>'

You also need to be careful, HTML is a tricky beast and can interfere with your regular expressions.

Processing HTML file using Python

More articles: