Beautifulsoup.get_text () is not specific enough for my HTML parsing

Question

Beautifulsoup.get_text () is not specific enough for my HTML parsing

Given the HTML code below, I want to output only h1 text, but not "Details about", which is the span text (which is enclosed in h1).

My current output gives:

Details about  New Men Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

I would like to:

 New Men Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Here is the HTML I'm working with

 <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

Here is my current code:

 for line in soup.find_all('h1',attrs={'itemprop':'name'}): print line.get_text()

Note. I don’t want to just trim the string because I would like this code to have some reuse. What would be best is some kind of code that displays any text limited to a range.

+6

python html regex beautifulsoup

Rorschach Jul 16 '15 at 18:57

source share

2 answers

One solution is to check if the html string contains:

 from bs4 import BeautifulSoup html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>""" soup = BeautifulSoup(html, 'html.parser') for line in soup.find_all('h1', attrs={'itemprop': 'name'}): for content in line.contents: if bool(BeautifulSoup(str(content), "html.parser").find()): continue print content

Another solution (which I prefer) is to check the instance of bs4.element.Tag :

 import bs4 html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>""" soup = bs4.BeautifulSoup(html, 'html.parser') for line in soup.find_all('h1', attrs={'itemprop': 'name'}): for content in line.contents: if isinstance(content, bs4.element.Tag): continue print content

0

dm295 Jul 16 '15 at 21:18

source share

Wiktor stribiżew · Accepted Answer · 2015-07-16T22:23:17+0000

You can use extract() to remove all span tags:

 for line in soup.find_all('h1',attrs={'itemprop':'name'}): [s.extract() for s in line('span')] print line.get_text() # => New Men Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Beautifulsoup.get_text () is not specific enough for my HTML parsing

More articles: