Extract news article content from saved .html pages

Question

Extract news article content from saved .html pages

I am reading text from html files and doing some analysis. These .html files are news articles.

The code:

html = open(filepath,'r').read() raw = nltk.clean_html(html) raw.unidecode(item.decode('utf8'))

Now I just want the content of the article, not the rest of the text, like advertisements, headlines, etc. How can I do this relatively accurately in python?

I know some tools like Jsoup (java api) and bolier , but I want to do this in python. I could find some methods using bs4 , but limited to one type of page. And I have news pages from numerous sources. In addition, there is the disadvantage of any sample code example.

I am looking for something like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

EDIT: For a better understanding, write a sample code to extract the contents of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high -cancer-risk.html? src = me & ref = general

+7

python urllib2 bs4

Abhishek bhatia May 20, '15 at 17:03

source share

5 answers

Newspaper is becoming more and more popular, I used it only superficially, but it looks good. This is only Python 3.

In quickstart, only loading from the URL is displayed, but you can load from an HTML string using

 import newspaper # LOAD HTML INTO STRING FROM FILE... article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED article.set_html(html)

+5

Harry Nov 22 '16 at 23:21

source share

Try something like this by visiting the page directly:

 ##Import modules from bs4 import BeautifulSoup import urllib2 ##Grab the page url = http://www.example.com req = urllib2.Request(url) page = urllib2.urlopen(req) content = page.read() page.close() ##Prepare soup = BeautifulSoup(content) ##Parse (a table, for example) for link in soup.find_all("table",{"class":"myClass"}): ...do something... pass

If you want to download the file, just replace the part where you take the page with the file. More details here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

+3

datasci May 20, '15 at 17:17

source share

You can use htmllib or HTMLParser; you can use them to parse your html file

 from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data # instantiate the parser and fed it some HTML parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1></body></html>')

Sample code taken from HTMLParser page

+1

Maoknight May 20, '15 at 17:09

source share

There are many ways to organize html-scaping in Python. As stated in other answers, tool # 1 is BeautifulSoup, but there are others:

Here are some useful resources:

There is no universal way to find the content of an article. HTML5 has an article tag hinting at the body of the text, and it is possible to customize page cleanup from specific publishing systems, but there is no general way to get the exact location of the text. (Theoretically, a machine can infer page structure from more than one structurally identical, different content article, but this is probably out of scope here.)

Also a web scraper with Python may be relevant.

Pyquery example for NYT:

 from pyquery import PyQuery as pq url = 'http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general' d = pq(url=url) text = d('.story-content').text()

+1

Roman susi May 20, '15 at 17:31

source share

oxymor0n · Accepted Answer · 2015-05-20T18:00:56+0000

Python has libraries for this:

Since you mentioned Java, there is a Python shell for the boiler pipe, which allows you to directly use it inside the python script: https://github.com/misja/python-boilerpipe

If you want to use pure python libraries, there are 2 options:

https://github.com/buriy/python-readability

and

https://github.com/grangier/python-goose

Of the two, I prefer Goose, but keep in mind that recent versions sometimes do not extract text for any reason (my recommendation is to use version 1.0.22)

EDIT: here is a sample code using Goose:

 from goose import Goose from requests import get response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text

Extract news article content from saved .html pages

More articles: