Given a standard html file containing css links, image links, etc., how can I extract only useful text? By value, I mean text that is relevant to the page. So, in the case of StackOverflow, texts of questions and answers. For a news site, this will be the topic of history.
One algorithm that you can use is to determine what the sentence is or not: search for sequences of words that contain a capital letter from the very beginning and stop completely at the end (rough, but just something to start with).
What are the alternatives?
Update: The pipe proposed by @ Wanai Jayaraman seems to work well. I have to add the following Maven dependencies for boiler pipe
<dependency>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>1.9.21</version>
</dependency>
Code (Scala) for extracting text:
val source = scala.io.Source.fromFile("c:\\news1.html")
val lines = source.mkString
source.close()
println(de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE.getText(lines));