Cut the largest block of text from an HTML document

I'm working on an algorithm that will try to select, given the HTML file, what it thinks is the parent element, which most likely contains most of the text in the content of the page. For example, it will select the contents of a div in the following HTML:

<html> <body> <div id="header">This is the header we don't care about</div> <div id="content">This is the <b>Main Page</b> content. it is the longest block of text in this document and should be chosen as most likely being the important page content.</div> </body> </html> 

I came up with several ideas, such as moving the HTML document tree to its leaves, adding the length of the text, and only viewing the other text that the parent has, if the parent gives us more content than the children do.

Has anyone ever tried something similar or knew about an algorithm that can be applied? It doesn't have to be hard, but as long as it can guess the container that contains most of the text in the content of the page (for example, for articles or blog posts), that would be great.

+4
source share
5 answers

You can create an application that searches for an adjacent block of text without regard to formatting tags (if necessary). You can do this using the DOM parser and traverse the tree, tracking the immediate parent (because this is your way out).

Begin to form the parent nodes and cross the tree for each node that only formats, it will continue to “count” in this subunit. It will read content characters.

Once you find the most content block, go back to the parent tree to get an answer.

I think your decision depends on how you go through the DOM and keep track of the nodes you are browsing.

What language do you use? Any other details for your project? There may be special language or package tools that you could use.

+1
source

One word: Boilerpipe

+9
source

Here's roughly how I would like to do this:

 // get array of all elements (body is used as parent here but you could use whatever) var elms = document.body.getElementsByTagName('*'); var nodes = Array.prototype.slice.call( elms, 0 ); // get inline elements out of the way (incomplete list) nodes = nodes.filter(function (elm) { return !/^(a|br?|hr|code|i(ns|mg)?|u|del|em|s(trong|pan))$/i.test( elm.nodeName ); }); // sort elements by most text first nodes.sort(function(a,b){ if (a.textContent.length == b.textContent.length) return 0; if (a.textContent.length > b.textContent.length) return -1; return 1; }); 

Using ancestor functions like a.compareDocumentPosition(b) , you can also immerse items during sorting (or after), depending on how complicated this thing is.

+5
source

You will also have to state the level at which you want to select node. In your example, the body of a node has even more text in it. Thus, you must formulate what exactly is the “parent element”.

+1
source

I can also say that word banks are a big help. Any lists of common words “advertise”, such as twitter and click, as well as several consecutive capital nouns. Having POS tags can improve accuracy. For news sites, a list of all the famous major cities in the world can help share. In fact, you can almost clear the page without even looking at the HTML.

0
source

All Articles