Best way to cut everything except text from a web page?

I want to take an html page and just extract the plain text on this page. Does anyone know a good way to do this in python?

I want to break literally everything and leave only the text of the articles, and the other text between the tags. JS, css etc. Gone

thanks!

+6
python
source share
6 answers

The first answer here does not remove the body of the CSS or JavaScript tags if they are on the page (unrelated). This may come close:

def stripTags(text): scripts = re.compile(r'<script.*?/script>') css = re.compile(r'<style.*?/style>') tags = re.compile(r'<.*?>') text = scripts.sub('', text) text = css.sub('', text) text = tags.sub('', text) return text 
+5
source share

You can try the pretty excellent Beautiful Soup

 f = open("my_source.html","r") s = f.read() f.close() soup = BeautifulSoup.BeautifulSoup(s) txt = soup.body.getText() 

But be careful: what you get from any parsing attempt will be error prone. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented, you should be in order or able to at least get around your features, but if these are just common things found "on the Internet", then expect all kinds of strange and wonderful outliers.

+4
source share

By here :

 def remove_html_tags(data): p = re.compile(r'<.*?>') return p.sub('', data) 

As he notes in the article, "the re-module must be imported in order to use the regular expression."

+3
source share

Consider the lxml.html module. However, removing CSS and JavaScript requires a bit of massaging:

 def stripsource(page): from lxml import html source = html.fromstring(page) for item in source.xpath("//style|//script|//comment()"): item.getparent().remove(item) for line in source.itertext(): if line.strip(): yield line 

The resulting lines can simply be combined, but this can lose significant word boundaries if there are no spaces around the generation of white space tags.

You can also iterate only the <body> , depending on your requirements.

+2
source share

I would also recommend BeautifulSoup, but I would recommend using something like the answer to this question , which I will copy here for those who do not want to look there:

 soup = BeautifulSoup.BeautifulSoup(html) texts = soup.findAll(text=True) def visible(element): if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: return False elif re.match('<!--.*-->', str(element)): return False return True visible_texts = filter(visible, texts) 

I tried this on this page, for example, and worked pretty well.

+2
source share

This was the simplest and easiest solution I found for strip CSS and JavaScript :

 ''.join(BeautifulSoup(content).findAll(text=lambda text: text.parent.name != "script" and text.parent.name != "style")) 

fooobar.com/questions/868336 / ... Matthew Flaschen

+1
source share

All Articles