Python regex

Question

Python regex

str1 = abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>

We need the contents inside the h1 tag and the h2 tag.

What is the best way to do this? Thanks

Thanks for the help!

+1

python regex

user469652 Nov 15 '10 at 7:36

source share

2 answers

First tip: DON'T USE REGULAR EXPRESSIONS FOR HTML / XML PARSING!

Now that we’ve found out, I suggest you look at Beautiful Soup . For Python, other SGML / XML / HTML parsers are available. However, this is one of the most beloved for working with the messy "tag soup" that most of us recognize in the real world. It does not require entrances to conform to standards or to be properly formed. If your browser manages to render it than Beautiful Soup, you may be able to parse it.

(Still tempted to use regular expressions for this task? I think, "it can't be so bad, I just want to extract exactly what is in the containers <h1>...</h1> and <h2>...</h2> "and ..." I ", you will never have to use any other corner cabinets:" This is crazy. The code that you write based on this line of reasoning will be fragile. It will work enough good to pass your tests and then it will get worse and worse every time you need to fix “one more thing.” Seriously, import the real parser and use it).

+2

Jim dennis Nov 15 '10 at 7:48

source share

Chris morgan · Accepted Answer · 2010-11-15T07:47:05+0000

The best way, if it should scale at all, is something like BeautifulSoup.

 >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>') >>> soup.h1 <h1>The content we need</h1> >>> soup.h1.text u'The content we need' >>> soup.h2 <h2>The content we need2</h2> >>> soup.h2.text u'The content we need2'

This can be done using regex, but it is probably more than what you want. A great example of what you want can be a good one. Not knowing that you want to parse it to help properly.

Python regex

More articles: