Given that the html file only extracts meaningful text

Given a standard html file containing css links, image links, etc., how can I extract only useful text? By value, I mean text that is relevant to the page. So, in the case of StackOverflow, texts of questions and answers. For a news site, this will be the topic of history.

One algorithm that you can use is to determine what the sentence is or not: search for sequences of words that contain a capital letter from the very beginning and stop completely at the end (rough, but just something to start with).

What are the alternatives?

Update: The pipe proposed by @ Wanai Jayaraman seems to work well. I have to add the following Maven dependencies for boiler pipe

<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.11.0</version>
</dependency>

<dependency>
    <groupId>net.sourceforge.nekohtml</groupId>
    <artifactId>nekohtml</artifactId>
    <version>1.9.21</version>
</dependency>

Code (Scala) for extracting text:

  val source = scala.io.Source.fromFile("c:\\news1.html")
  val lines = source.mkString
  source.close()
println(de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE.getText(lines));
+4
2

Boilerpipe Article Extractor, . Boilerpipe.

JSoup .

+2

HTML-, , id, .., .text(), HTML, .HTML. , . , .

0

All Articles