There is a small or quick way to clear content from a web page. I have done a lot. There is no simple rule. Previously, in the days of html3 / table-based design, there was a different way of identifying, and the site design itself was limited. Screen size was limited, so often the menu was on the top side and there was no room for right or left panels. then came the era with panels with tables. now is time with floating content. And then we even use overflow: hidden, so itβs even harder to know the number of words in the body, etc.
When writing an html file, the code is never marked as content or menu. Sometimes you can get this from class names, but this is not universal. content gets its size and position from CSS. therefore, your parser alone cannot determine the body of the page. If you use the built-in html viewer and use DHTML / JS to determine the block sizes after rendering, there may be some way to do this, but it will never be universal. My suggestion is to make your parser and improve it in every case.
For google, he created programs for most combinations of html projects. But even for google, creating a universal parser, I think this is impossible.
source share