I use this code to find all the interesting links on the page:
soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))
And he does his job very well. Unfortunately, inside the a tag there are many nested tags, such as font , b and various things ... I would like to get just textual content, without any other html tags.
Link example:
<A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009: <font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A>
Of course, this is ugly (and the markup is not always the same!), And I would like to get:
03-11-2009: CCS Ingegneria Elettronica-Sportello studenti ed orientamento
The documentation says to use the text=True method in findAll, but it will ignore my regular expression. What for? How can i solve this?
python html-parsing beautifulsoup html-content-extraction
Andrea Ambu Nov 17 '09 at 23:38 2009-11-17 23:38
source share