If the content is textual, you can assume that the main content of the page is where the word density is relatively higher.
This means that the main content of the page is related to search engines - the main content of the page is inside the dom elements, mainly divs, where the number of literals, including tags, such as p, em, b, etc. etc., which are essentially for formatting text above or above a threshold value.
I will start with the following logic
Get all tags used on a web page.
I will notice dom elements, where the content is formed only from literals and formatting tags, such as p, em, b, li, ul, as well as anchor tags.
I would leave divs containing only anchor tags and suggest that they are for navigation purposes only.
Now from all this we select the dom elements, where the number exceeds a certain threshold.
This threshold value ranges from a website to a website that you can take as avg (literals found in divs having the highest literals on all pages of a site with a specific URL structure).
The algorithm must learn during its course.
source share