I am looking for an algorithm (or some other method) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I read the RSS feed programmatically from Google News. I am interested in cutting out the actual content of the main articles. From my first try, I have the URLs from the RSS feed and I just follow them and clear the HTML from this page. This very clearly led to a lot of “noise”, be it HTML tags, headers, navigation, etc. In principle, all information that is not related to the actual content of the article.
Now I understand that this is an extremely difficult problem to solve, theoretically it would have to include a parser for each website. I'm interested in an algorithm (I even agree with the idea) on how to maximize the actual content that I see when I load an article and reduce the amount of noise.
A few additional notes:
- HTML scraping is just the first attempt I tried. I am not for sale, that is the best way to do something.
- I don’t want to write a parser for every website I come across, I need the unpredictability of accepting any Google services through the RSS feed.
- I know which algorithm I will not ultimately be perfect, but I'm interested in the best possible solution.
Any ideas?