Retrieving the * relevant * image from the web page

I have several news aggregation websites on Twitter. I planned to add images from the articles I find on Twitter.

If I load the page and retrieve the image using the <img> , I get a bunch of images; not all of them relate to this article. For example, images of buttons, badges, ads, etc. Captured. How to extract the image accompanying the article? I know there is a solution - Facebook link sharer does it pretty well.

Mithun

Duplicate: How to find and extract the "main" image on a website

+7
html parsing
source share
4 answers

It has been a long time. But this may help next time.

You can use this API https://urlmeta.org/

This is a very simple use and the result is the best we need.

API usage example:

 <?php $url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms"; $result = file_get_contents('https://api.urlmeta.org/?url='.$url); $array = json_decode($result,1); print_r($array['meta']['image']); ?> 

And you need this result.

+3
source share

Download all the images from the page, a black list of all the images coming from the ad server. then find a heuristic that will give you the correct image ...

I think something like:

  • Highest Resolution + = 5pts
  • Largest file size + = 10 points
  • Jpeg + = 2 pts

then take the image with the most points and discard the rest.

Probably works for most sites.

(It takes some messing with heuristics)

+7
source share

I kind of came up with a solution that is a bit hacky, but works for me. Here is what I do to get the sketches.

  • Say the title of the page I find is "this is the title"
  • I use this as a request to the Google Image API, and then retrieve the first thumbnail that I find.

In fact, it works very well for most cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Pays tribute to those who come with a more elegant answer.

+3
source share

I would suggest that Facebook has a link extractor for the various sites that it supports. Something like id = "content" → img (1st).

Guess I'm wrong. Facebook seems to use the Open Graph Protocol to determine which image (og: image) and which metadata to use.

+1
source share

All Articles