I ran into the same problem trying to parse the displayed HTML. BS doesn't seem to be the perfect package for this. @Del gives a great html2text solution.
In a different SO question: BeautifulSoup get_text does not share all tags and JavaScript @Helge is mentioned using nltk. Unfortunately, nltk seems to terminate this method.
I tried both html2text and nltk.clean_html and was surprised by the synchronization results, so I thought that they guaranteed a response for posterity. Of course, speeds are highly dependent on the contents of the data ...
Reply from @Helge (nltk).
import nltk %timeit nltk.clean_html(html) was returning 153 us per loop
It worked very well to return a line with html displayed. This nltk module was faster than html2text, although perhaps html2text is more reliable.
Answer above from @del
betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop
Paul Nov 05 '13 at 17:53 2013-11-05 17:53
source share