Algorithm for reading the actual content of news articles and ignoring the "noise" on the page?

I am looking for an algorithm (or some other method) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I read the RSS feed programmatically from Google News. I am interested in cutting out the actual content of the main articles. From my first try, I have the URLs from the RSS feed and I just follow them and clear the HTML from this page. This very clearly led to a lot of “noise”, be it HTML tags, headers, navigation, etc. In principle, all information that is not related to the actual content of the article.

Now I understand that this is an extremely difficult problem to solve, theoretically it would have to include a parser for each website. I'm interested in an algorithm (I even agree with the idea) on how to maximize the actual content that I see when I load an article and reduce the amount of noise.

A few additional notes:

  • HTML scraping is just the first attempt I tried. I am not for sale, that is the best way to do something.
  • I don’t want to write a parser for every website I come across, I need the unpredictability of accepting any Google services through the RSS feed.
  • I know which algorithm I will not ultimately be perfect, but I'm interested in the best possible solution.

Any ideas?

+5
9

RSS- Readability, , . Javascript, , . , .

+2

, , , , , . .

+3

templatemaker ( Google). , , , . , .

diff , , . , , - , () .

+2

, robots.txt, , XML:

  • , . , "", " " " " . , , .

  • . . node , .

  • . , , , , , , javascript .

  • , (.. , ).

  • . - h1, h2 , , , , , - .

  • , (- ), ( copyright) . - , , , ( , , .)

+1
+1

, Boilerpipe.

, , . wiki:

Java "" (, ) -.

- , , , - #, : NBoilerpipe.

+1

( ) , :

, RSS- , , DOM. DOM ( DIV? ?) snip. .

, XML (HtmlAgilityPack ), () <p> Linq2Xml:

            document
                .Descendants(XName.Get("p", "http://www.w3.org/1999/xhtml"))
                .Select(
                p=>p
                       .DescendantNodes()
                       .Where(n => n.NodeType == XmlNodeType.Text)
                       .Select(t=>t.ToString())
                )
                .Where(c=>c.Any())
                .Select(c=>c.Aggregate((a,b)=>a+b))
                .Aggregate((a,b)=>a+"\r\n\r\n"+b);

, , , , .

0

, , , , . , . , ..

, , , ? , (.. )?

0

You might want to look at the Hidden Dirichlet distribution , which is IR technology for generating those from text data that you have. This should help you reduce noise and get accurate information about what you are talking about.

0
source

All Articles