Beautifulsoup.get_text () is not specific enough for my HTML parsing

Given the HTML code below, I want to output only h1 text, but not "Details about", which is the span text (which is enclosed in h1).

My current output gives:

Details about  New Men Genuine Leather Bifold ID Credit Card Money Holder Wallet Black 

I would like to:

 New Men Genuine Leather Bifold ID Credit Card Money Holder Wallet Black 

Here is the HTML I'm working with

 <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1> 

Here is my current code:

 for line in soup.find_all('h1',attrs={'itemprop':'name'}): print line.get_text() 

Note. I don’t want to just trim the string because I would like this code to have some reuse. What would be best is some kind of code that displays any text limited to a range.

+6
source share
2 answers

You can use extract() to remove all span tags:

 for line in soup.find_all('h1',attrs={'itemprop':'name'}): [s.extract() for s in line('span')] print line.get_text() # => New Men Genuine Leather Bifold ID Credit Card Money Holder Wallet Black 
+5
source

One solution is to check if the html string contains:

 from bs4 import BeautifulSoup html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>""" soup = BeautifulSoup(html, 'html.parser') for line in soup.find_all('h1', attrs={'itemprop': 'name'}): for content in line.contents: if bool(BeautifulSoup(str(content), "html.parser").find()): continue print content 

Another solution (which I prefer) is to check the instance of bs4.element.Tag :

 import bs4 html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>""" soup = bs4.BeautifulSoup(html, 'html.parser') for line in soup.find_all('h1', attrs={'itemprop': 'name'}): for content in line.contents: if isinstance(content, bs4.element.Tag): continue print content 
0
source

All Articles