I'm working on an algorithm that will try to select, given the HTML file, what it thinks is the parent element, which most likely contains most of the text in the content of the page. For example, it will select the contents of a div in the following HTML:
<html> <body> <div id="header">This is the header we don't care about</div> <div id="content">This is the <b>Main Page</b> content. it is the longest block of text in this document and should be chosen as most likely being the important page content.</div> </body> </html>
I came up with several ideas, such as moving the HTML document tree to its leaves, adding the length of the text, and only viewing the other text that the parent has, if the parent gives us more content than the children do.
Has anyone ever tried something similar or knew about an algorithm that can be applied? It doesn't have to be hard, but as long as it can guess the container that contains most of the text in the content of the page (for example, for articles or blog posts), that would be great.
source share