Python has libraries for this:
Since you mentioned Java, there is a Python shell for the boiler pipe, which allows you to directly use it inside the python script: https://github.com/misja/python-boilerpipe
If you want to use pure python libraries, there are 2 options:
https://github.com/buriy/python-readability
and
https://github.com/grangier/python-goose
Of the two, I prefer Goose, but keep in mind that recent versions sometimes do not extract text for any reason (my recommendation is to use version 1.0.22)
EDIT: here is a sample code using Goose:
from goose import Goose from requests import get response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text
oxymor0n
source share