Algorithm for reading the actual content of news articles and ignoring the "noise" on the page?

Question

Algorithm for reading the actual content of news articles and ignoring the "noise" on the page?

I am looking for an algorithm (or some other method) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I read the RSS feed programmatically from Google News. I am interested in cutting out the actual content of the main articles. From my first try, I have the URLs from the RSS feed and I just follow them and clear the HTML from this page. This very clearly led to a lot of “noise”, be it HTML tags, headers, navigation, etc. In principle, all information that is not related to the actual content of the article.

Now I understand that this is an extremely difficult problem to solve, theoretically it would have to include a parser for each website. I'm interested in an algorithm (I even agree with the idea) on how to maximize the actual content that I see when I load an article and reduce the amount of noise.

A few additional notes:

HTML scraping is just the first attempt I tried. I am not for sale, that is the best way to do something.
I don’t want to write a parser for every website I come across, I need the unpredictability of accepting any Google services through the RSS feed.
I know which algorithm I will not ultimately be perfect, but I'm interested in the best possible solution.

Any ideas?

+5

c # algorithm .net text parsing

The matt 20 . '09 20:09

9

, , , , , . .

+3

Bill the Lizard 20 . '09 20:21

templatemaker ( Google). , , , . , .

diff , , . , , - , () .

+2

Steven Kryskalla 20 . '09 20:21

, robots.txt, , XML:

, . , "", " " " " . , , .
. . node , .
. , , , , , , javascript .
, (.. , ).
. - h1, h2 , , , , , - .
, (- ), ( copyright) . - , , , ( , , .)

+1

ilya n. 20 . '09 22:53

BTE (Body Text Extraction) - Python, .

http://www.aidanf.net/archive/software/bte-body-text-extraction

, .

+1

Al AlBahri 04 . '11 9:00

, Boilerpipe.

, , boilerpipe. wiki:

Java "" (, ) -.

- , , , - #, : NBoilerpipe.

+1

hippietrail 25 . '12 21:25

( ) , :

, RSS- , , DOM. DOM ( DIV? ?) snip. .

, XML (HtmlAgilityPack ), () <p> Linq2Xml:

            document
                .Descendants(XName.Get("p", "http://www.w3.org/1999/xhtml"))
                .Select(
                p=>p
                       .DescendantNodes()
                       .Where(n => n.NodeType == XmlNodeType.Text)
                       .Select(t=>t.ToString())
                )
                .Where(c=>c.Any())
                .Select(c=>c.Aggregate((a,b)=>a+b))
                .Aggregate((a,b)=>a+"\r\n\r\n"+b);

, , , , .

0

spender 20 . '09 20:31

, , , , . , . , ..

, , , ? , (.. )?

0

Joseph Ferris 20 . '09 23:38

You might want to look at the Hidden Dirichlet distribution , which is IR technology for generating those from text data that you have. This should help you reduce noise and get accurate information about what you are talking about.

0

20 sept '09 at 23:51

source share

Chris Ballance · Accepted Answer · 2009-09-20T22:00:24+0000

RSS- Readability, , . Javascript, , . , .

Algorithm for reading the actual content of news articles and ignoring the "noise" on the page?

More articles: