Need to find text with RegEx and BeautifulSoup

I am trying to parse a website in order to pull out some data that is stored in the body, for example:

<body> <b>INFORMATION</b> Hookups: None Group Sites: No Station: No <b>Details</b> Ramp: Yes </body> 

I would like to use BeautifulSoup4 and RegEx to pull the values ​​for Hookups and Group Sites and so on, but I'm new to both bs4 and RegEx. I tried the following to get the value of Hookups:

 soup = BeautifulSoup(open('doc.html')) hookups = soup.find_all(re.compile("Hookups:(.*)Group")) 

But the search returns empty.

+2
python regex web-scraping beautifulsoup
source share
1 answer

BeautifulSoup find_all only works with tags. You can use just pure regex to get what you need if the HTML is simple. Otherwise, you can use find_all and then get the .text nodes.

 re.findall("Hookups: (.*)", open('doc.html').read()) 

You can also search the contents of a tag using the text property as on BeautifulSoup 4.2

 soup.find_all(text=re.compile("Hookups:(.*)Group")); 
+15
source share

All Articles