Clearing only the main content of a web page (ignoring the title, footer and sidebars)

Question

Clearing only the main content of a web page (ignoring the title, footer and sidebars)

I am familiar with curettage and using XPATH in php to parse the DOM to get what I want from the page. I would like to hear some suggestions on how I could programmatically ignore the title, footer and sidebars on the page and extract only the main content.

The situation is that there is no specific purpose, so I can’t just ignore a specific id, such as #header and #footer, because each page is written a little differently.

I know that Google is doing this, I know that it should be possible, I just don’t know where to start.

Thanks!

+4

php xpath screen-scraping

deweydb Mar 26 '13 at 17:15

source share

2 answers

Fabian schmengler · Answer 1 · 2013-03-31T11:51:42+0000

There is no definite way to determine it, but you can get reasonable results using heuristic methods. Sentence:

Clear two or more pages from the same website and start comparing them by block, starting at the top level, going through several levels until the blocks are sufficiently equal. The comparison will not be == but a similarity index, for example with similar_text . Blocks above a certain percentage of similarity are more likely to be a title, footer, or menu. You will need to know from experience which threshold is useful.

thevikas · Answer 2 · 2013-03-31T07:59:38+0000

There is a small or quick way to clear content from a web page. I have done a lot. There is no simple rule. Previously, in the days of html3 / table-based design, there was a different way of identifying, and the site design itself was limited. Screen size was limited, so often the menu was on the top side and there was no room for right or left panels. then came the era with panels with tables. now is time with floating content. And then we even use overflow: hidden, so it’s even harder to know the number of words in the body, etc.

When writing an html file, the code is never marked as content or menu. Sometimes you can get this from class names, but this is not universal. content gets its size and position from CSS. therefore, your parser alone cannot determine the body of the page. If you use the built-in html viewer and use DHTML / JS to determine the block sizes after rendering, there may be some way to do this, but it will never be universal. My suggestion is to make your parser and improve it in every case.

For google, he created programs for most combinations of html projects. But even for google, creating a universal parser, I think this is impossible.

Clearing only the main content of a web page (ignoring the title, footer and sidebars)

More articles: