Extract readable text from HTML using Python?

Question

Extract readable text from HTML using Python?

I know about utilities like html2text, BeautifulSoup, etc., but the problem is that they also extract javascript and add it to the text, which makes it difficult to separate them.

htmlDom = BeautifulSoup(webPage) htmlDom.findAll(text=True)

As an alternative

 from stripogram import html2text extract = html2text(webPage)

Both of them also extract all javascript on the page, this is undesirable.

I just need text that you can copy from your browser to extract it.

+4

python html text-extraction

demos Jul 03 '10 at 17:59

source share

4 answers

you can remove the script tags in a beautiful soup, something like:

 for script in soup("script"): script.extract()

Delete items

+1

jkyle Jul 03 '10 at 18:35

source share

Using BeautifulSoup, something like that:

 def _extract_text(t): if not t: return "" if isinstance(t, (unicode, str)): return " ".join(filter(None, t.replace("\n", " ").split(" "))) if t.name.lower() == "br": return "\n" if t.name.lower() == "script": return "\n" return "".join(extract_text(c) for c in t) def extract_text(t): return '\n'.join(x.strip() for x in _extract_text(t).split('\n')) print extract_text(htmlDom)

0

Forrest voight Jul 03 '10 at 18:32

source share

Try:

http://code.google.com/p/boilerpipe/

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

0

saravanan Feb 07 '12 at 18:38

source share

Alex martelli · Accepted Answer · 2010-07-03T18:39:25+0000

If you want to avoid extracting the contents of the script tag using BeautifulSoup,

 nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do this for you, getting the root immediate children that are not script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are the immediate children of the root). You need to do this recursively; for example, as a generator:

 def nonScript(tag): return tag.name != 'script' def getStrings(root): for s in root.childGenerator(): if hasattr(s, 'name'): # then it a tag if s.name == 'script': # skip it! continue for x in getStrings(s): yield x else: # it a string! yield s

I use childGenerator (instead of findAll ) so that I can just tidy up all the children and do my own filtering.

Extract readable text from HTML using Python?

More articles: