Need to find text with RegEx and BeautifulSoup

Question

Need to find text with RegEx and BeautifulSoup

I am trying to parse a website in order to pull out some data that is stored in the body, for example:

<body> <b>INFORMATION</b> Hookups: None Group Sites: No Station: No <b>Details</b> Ramp: Yes </body>

I would like to use BeautifulSoup4 and RegEx to pull the values for Hookups and Group Sites and so on, but I'm new to both bs4 and RegEx. I tried the following to get the value of Hookups:

 soup = BeautifulSoup(open('doc.html')) hookups = soup.find_all(re.compile("Hookups:(.*)Group"))

But the search returns empty.

+2

python python-2.7 regex web-scraping beautifulsoup

bcoop713 May 07 '13 at 14:02

source share

1 answer

Explosion pills · Accepted Answer · 2013-05-07T14:22:21+0000

BeautifulSoup find_all only works with tags. You can use just pure regex to get what you need if the HTML is simple. Otherwise, you can use find_all and then get the .text nodes.

 re.findall("Hookups: (.*)", open('doc.html').read())

You can also search the contents of a tag using the text property as on BeautifulSoup 4.2

 soup.find_all(text=re.compile("Hookups:(.*)Group"));

Need to find text with RegEx and BeautifulSoup

More articles: