How can I get a basic image of a blog article / news?

I have a Newzupp news aggregator that I want to change. Now I just display the news headlines, and I associate them with their URL.

I plan to make it more graphic using images + headers instead of simple headers. I want to know how to get the main image of each article (somewhat similar to Google news).

One of the ways I can think of is to strip all the images and display an image that points to the same article. But I do not think it will be effective. Is there any other way to do this?


I found a solution for it.

  • Get URL content [html / xml]
  • Clear contents using hpricot
  • Find all items tagged with "img"
  • Do some research to determine which one is the main image displayed. [Like 6th image in case of Wired.com rss channel]

I still think this is very inefficient. I would like to know how services like Google News clean sites / blogs and display relevant images.

+4
source share
4 answers

Perhaps you can filter / sort by size or by size of the image in the DOM hierarchy (i.e. closer to the top of the body / immediately after the h1 tag).

+2
source

What about a blacklist of ad hosts from which you would ignore images?

Since, generally speaking, advertisements are placed elsewhere, while history-related images are placed in the same domain, perhaps you can filter the page for those images that have the same base url as the site itself.

+1
source

Why not just convert all cleaned images (using hpricot / nokogiri) to square thumbnails (using rmagick or similar or just resize them on the server side) and group these images into a single DIV just below the body of the theme, then you can use the lightbox with slide show to show actual images only when the user clicks on them. Thus, it looks more graphic and does not spoil the look of your site. Finding the most relevant image is more difficult.

0
source

You can also try searching for OpenGraph tags on pages. Most news sites use the og:image property to specify the main image of the article.

Example:

 <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> 
0
source

Source: https://habr.com/ru/post/1314655/


All Articles