How do scanners process text from a web page?

There are stanadard methods such as the DOM for selective analysis of an html page, but I wonder how scanners (small to large) can determine where the main text for analysis is located?

The main text that will be analyzed to capture its keywords is mixed with menus, sidebars, headers, etc. How does the scanner know to skip keywords from menus and sides?

I am working on a small PHP project to capture keywords from various HTML pages, and I don't know how to avoid keywords taken from side content. Can someone describe, or at least give me a hint, how to distinguish the main content from the others on the HTML page?

+4
source share
2 answers

Sidebars, menus, and footers are usually repeated on every page throughout the site. Actual content is usually unique to each page. You can use this as a guide to distinguish between actual content.

Scanners also use sophisticated algorithms to analyze the text on the page to determine its weight as content, and they usually do not share their secrets.

There is no quick and easy way; scanner developers must come up with their own innovative methods and use them together to get an overall picture of the contents of the page.

+2
source

If the content is textual, you can assume that the main content of the page is where the word density is relatively higher.

This means that the main content of the page is related to search engines - the main content of the page is inside the dom elements, mainly divs, where the number of literals, including tags, such as p, em, b, etc. etc., which are essentially for formatting text above or above a threshold value.

I will start with the following logic

Get all tags used on a web page.

I will notice dom elements, where the content is formed only from literals and formatting tags, such as p, em, b, li, ul, as well as anchor tags.

I would leave divs containing only anchor tags and suggest that they are for navigation purposes only.

Now from all this we select the dom elements, where the number exceeds a certain threshold.

This threshold value ranges from a website to a website that you can take as avg (literals found in divs having the highest literals on all pages of a site with a specific URL structure).

The algorithm must learn during its course.

0
source

Source: https://habr.com/ru/post/1412196/


All Articles