Best way to cut everything except text from a web page?

Question

Best way to cut everything except text from a web page?

I want to take an html page and just extract the plain text on this page. Does anyone know a good way to do this in python?

I want to break literally everything and leave only the text of the articles, and the other text between the tags. JS, css etc. Gone

thanks!

+6

python

James Jun 04 '10 at 21:25

source share

6 answers

gddc · Answer 1 · 2010-06-04T21:38:59+0000

The first answer here does not remove the body of the CSS or JavaScript tags if they are on the page (unrelated). This may come close:

def stripTags(text): scripts = re.compile(r'<script.*?/script>') css = re.compile(r'<style.*?/style>') tags = re.compile(r'<.*?>') text = scripts.sub('', text) text = css.sub('', text) text = tags.sub('', text) return text

pycruft · Answer 2 · 2010-06-04T22:28:23+0000

You can try the pretty excellent Beautiful Soup

 f = open("my_source.html","r") s = f.read() f.close() soup = BeautifulSoup.BeautifulSoup(s) txt = soup.body.getText()

But be careful: what you get from any parsing attempt will be error prone. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented, you should be in order or able to at least get around your features, but if these are just common things found "on the Internet", then expect all kinds of strange and wonderful outliers.

Oren Hizkiya · Answer 3 · 2010-06-04T21:28:11+0000

By here :

 def remove_html_tags(data): p = re.compile(r'<.*?>') return p.sub('', data)

As he notes in the article, "the re-module must be imported in order to use the regular expression."

eswald · Answer 4 · 2010-06-04T21:58:00+0000

Consider the lxml.html module. However, removing CSS and JavaScript requires a bit of massaging:

 def stripsource(page): from lxml import html source = html.fromstring(page) for item in source.xpath("//style|//script|//comment()"): item.getparent().remove(item) for line in source.itertext(): if line.strip(): yield line

The resulting lines can simply be combined, but this can lose significant word boundaries if there are no spaces around the generation of white space tags.

You can also iterate only the <body> , depending on your requirements.

Justin peel · Answer 5 · 2010-06-04T22:34:27+0000

I would also recommend BeautifulSoup, but I would recommend using something like the answer to this question , which I will copy here for those who do not want to look there:

 soup = BeautifulSoup.BeautifulSoup(html) texts = soup.findAll(text=True) def visible(element): if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: return False elif re.match('<!--.*-->', str(element)): return False return True visible_texts = filter(visible, texts)

I tried this on this page, for example, and worked pretty well.

Peter Long Nguyen · Answer 6 · 2013-07-28T09:18:13+0000

This was the simplest and easiest solution I found for strip CSS and JavaScript :

 ''.join(BeautifulSoup(content).findAll(text=lambda text: text.parent.name != "script" and text.parent.name != "style"))

fooobar.com/questions/868336 / ... Matthew Flaschen

Best way to cut everything except text from a web page?

More articles: