Extract readable text from HTML using Python?

I know about utilities like html2text, BeautifulSoup, etc., but the problem is that they also extract javascript and add it to the text, which makes it difficult to separate them.

htmlDom = BeautifulSoup(webPage) htmlDom.findAll(text=True) 

As an alternative

 from stripogram import html2text extract = html2text(webPage) 

Both of them also extract all javascript on the page, this is undesirable.

I just need text that you can copy from your browser to extract it.

+4
source share
4 answers

If you want to avoid extracting the contents of the script tag using BeautifulSoup,

 nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False) 

will do this for you, getting the root immediate children that are not script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are the immediate children of the root). You need to do this recursively; for example, as a generator:

 def nonScript(tag): return tag.name != 'script' def getStrings(root): for s in root.childGenerator(): if hasattr(s, 'name'): # then it a tag if s.name == 'script': # skip it! continue for x in getStrings(s): yield x else: # it a string! yield s 

I use childGenerator (instead of findAll ) so that I can just tidy up all the children and do my own filtering.

+5
source

you can remove the script tags in a beautiful soup, something like:

 for script in soup("script"): script.extract() 

Delete items

+1
source

Using BeautifulSoup, something like that:

 def _extract_text(t): if not t: return "" if isinstance(t, (unicode, str)): return " ".join(filter(None, t.replace("\n", " ").split(" "))) if t.name.lower() == "br": return "\n" if t.name.lower() == "script": return "\n" return "".join(extract_text(c) for c in t) def extract_text(t): return '\n'.join(x.strip() for x in _extract_text(t).split('\n')) print extract_text(htmlDom) 
0
source

Source: https://habr.com/ru/post/1314632/


All Articles