How to find and extract the "main" image on a website

I need help solving a problem. I need a program that, given the site, finds and extracts the "main" image, that is, the one that represents the site. (To say that this is the largest or first picture sometimes, but not always).

How do I approach this? Are there any libraries that could help me with this? Thanks!

+6
java html
source share
5 answers

OPTION 1

You can check goose . It does something similar to what Pocket and Readability do, i.e. Attempts to extract the main article from this web page using a set of heuristics. Apparently, he can also extract the main image from this article, but these are a few hits and misses, so 60% of the time he works every time.

It used to be a Java project, but rewritten in Scala.

From readme file

Goose will try to extract the following information:

  • The main text of the article
  • The main image of the article
  • Any Youtube / Vimeo movies embedded in an article
  • Meta description
  • Meta tags
  • Publication Date

Try it here: http://jimplush.com/blog/goose


OPTION 2

You can use the Java shell (e.g. GhostDriver ) to launch a browser without a browser, such as PhantomJS . Then select a website and find the largest img element. This GhostDriver test case shows how to query the DOM for elements and get its renderd size.


OPTION 3

Use a library like jsoup to help you parse the HTML. Then get the value from the src attribute from all the img tags. Request each URL that you find for the image and rate their sizes. The one with the largest sizes will most likely be the main image of the website.

+9
source share

Another solution is to first extract the meta tags for sharing social networks, if they are present, you are lucky, otherwise you can try other solutions.

 <meta property="og:image" content="http://www.example.com/image.jpg"/> <meta name="twitter:image" content="http://www.example.com/image.jpg"> <meta itemprop="image" content="http://www.example.com/image.jpg"> 

If you use JSOUP, the code will be like this:

  String imageUrlOpenGraph = document.select("meta[property=og:image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); String imageUrlTwitter = document.select("meta[name=twitter:image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); String imageUrlGooglePlus = document.select("meta[itemprop=image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); 
+2
source share

This requires artificial intelligence, namely computer vision. He is too big to fit in the answer. This link may help.

If you are a mathematician with experience with probability and a Bayesian rule, you can just take a unit called Image Processing and Computer Vision .

If you are looking for affordable software that you want to use check out this ...

This stackoverflow can help ...

There, this software is called moodstocks , which can help.

0
source share

You can use the service, for example embedly . Among many other data, they allow you to extract the main image of any page. Especially suitable for articles. You can try it here .

0
source share

ImageResolver can do this for you without the need for server-side interaction, with the exception of a small script proxy.

0
source share

All Articles