I would also recommend BeautifulSoup, but I would recommend using something like the answer to this question , which I will copy here for those who do not want to look there:
soup = BeautifulSoup.BeautifulSoup(html) texts = soup.findAll(text=True) def visible(element): if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: return False elif re.match('<!--.*-->', str(element)): return False return True visible_texts = filter(visible, texts)
I tried this on this page, for example, and worked pretty well.
Justin peel
source share