BeautifulSoup is an easy way to get content without HTML content

I use this code to find all the interesting links on the page:

soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) 

And he does his job very well. Unfortunately, inside the a tag there are many nested tags, such as font , b and various things ... I would like to get just textual content, without any other html tags.

Link example:

 <A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009:&nbsp;&nbsp;<font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A> 

Of course, this is ugly (and the markup is not always the same!), And I would like to get:

 03-11-2009: CCS Ingegneria Elettronica-Sportello studenti ed orientamento 

The documentation says to use the text=True method in findAll, but it will ignore my regular expression. What for? How can i solve this?

+7
python html-parsing beautifulsoup html-content-extraction
Nov 17 '09 at 23:38
source share
2 answers

I used this:

 def textOf(soup): return u''.join(soup.findAll(text=True)) 

So...

 texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))] 
+12
Nov 18 '09 at 0:04
source share

Is pyparsing interested in the problem?

 from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, anyCloseTag, ParseException htmlsrc = """<A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009:&nbsp;&nbsp;<font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A>""" # create pattern to find interesting <A> tags aStart,aEnd = makeHTMLTags("A") def matchInterestingHrefsOnly(t): if not t.href.startswith("notizia.php?"): raise ParseException("not interested...") aStart.setParseAction(matchInterestingHrefsOnly) patt = aStart + SkipTo(aEnd)("body") + aEnd # create pattern to strip HTML tags, and convert HTML entities stripper = anyOpenTag.suppress() | anyCloseTag.suppress() def stripTags(s): s = stripper.transformString(s) s = s.replace("&nbsp;"," ") return s for match in patt.searchString(htmlsrc): print stripTags(match.body) 

Print

 03-11-2009: CCS Ingegneria Elettronica-Sportello studenti ed orientamento 

This is actually pretty impenetrable for HTML vagaries, as it affects the presence / absence of attributes, upper / lower case, etc.

+2
Nov 18 '09 at 0:45
source share



All Articles