BeautifulSoup is an easy way to get content without HTML content

Question

BeautifulSoup is an easy way to get content without HTML content

I use this code to find all the interesting links on the page:

soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))

And he does his job very well. Unfortunately, inside the a tag there are many nested tags, such as font , b and various things ... I would like to get just textual content, without any other html tags.

Link example:

 <A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009:&nbsp;&nbsp;<font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A>

Of course, this is ugly (and the markup is not always the same!), And I would like to get:

 03-11-2009: CCS Ingegneria Elettronica-Sportello studenti ed orientamento

The documentation says to use the text=True method in findAll, but it will ignore my regular expression. What for? How can i solve this?

+7

python html-parsing beautifulsoup html-content-extraction

Andrea Ambu Nov 17 '09 at 23:38

source share

2 answers

Is pyparsing interested in the problem?

 from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, anyCloseTag, ParseException htmlsrc = """<A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009:&nbsp;&nbsp;<font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A>""" # create pattern to find interesting <A> tags aStart,aEnd = makeHTMLTags("A") def matchInterestingHrefsOnly(t): if not t.href.startswith("notizia.php?"): raise ParseException("not interested...") aStart.setParseAction(matchInterestingHrefsOnly) patt = aStart + SkipTo(aEnd)("body") + aEnd # create pattern to strip HTML tags, and convert HTML entities stripper = anyOpenTag.suppress() | anyCloseTag.suppress() def stripTags(s): s = stripper.transformString(s) s = s.replace("&nbsp;"," ") return s for match in patt.searchString(htmlsrc): print stripTags(match.body)

Print

 03-11-2009: CCS Ingegneria Elettronica-Sportello studenti ed orientamento

This is actually pretty impenetrable for HTML vagaries, as it affects the presence / absence of attributes, upper / lower case, etc.

+2

PaulMcG Nov 18 '09 at 0:45

source share

Jonathan Feinberg · Accepted Answer · 2009-11-18 00:04

I used this:

 def textOf(soup): return u''.join(soup.findAll(text=True))

So...

 texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))]

BeautifulSoup is an easy way to get content without HTML content

More articles: