How to find and extract the "main" image on a website

Question

How to find and extract the "main" image on a website

I need help solving a problem. I need a program that, given the site, finds and extracts the "main" image, that is, the one that represents the site. (To say that this is the largest or first picture sometimes, but not always).

How do I approach this? Are there any libraries that could help me with this? Thanks!

+6

java html

nodwj Aug 16 '13 at 7:49

source share

5 answers

Another solution is to first extract the meta tags for sharing social networks, if they are present, you are lucky, otherwise you can try other solutions.

 <meta property="og:image" content="http://www.example.com/image.jpg"/> <meta name="twitter:image" content="http://www.example.com/image.jpg"> <meta itemprop="image" content="http://www.example.com/image.jpg">

If you use JSOUP, the code will be like this:

  String imageUrlOpenGraph = document.select("meta[property=og:image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); String imageUrlTwitter = document.select("meta[name=twitter:image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); String imageUrlGooglePlus = document.select("meta[itemprop=image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null);

+2

mmx73 Jan 27 '16 at 11:52

source share

This requires artificial intelligence, namely computer vision. He is too big to fit in the answer. This link may help.

If you are a mathematician with experience with probability and a Bayesian rule, you can just take a unit called Image Processing and Computer Vision .

If you are looking for affordable software that you want to use check out this ...

This stackoverflow can help ...

There, this software is called moodstocks , which can help.

0

Anshu Dwibhashi Aug 16 '13 at 7:54

source share

You can use the service, for example embedly . Among many other data, they allow you to extract the main image of any page. Especially suitable for articles. You can try it here .

0

lex82 Jan 30 '14 at 20:57

source share

ImageResolver can do this for you without the need for server-side interaction, with the exception of a small script proxy.

0

Ma'moon al-akash Sep 26 '16 at 7:18

source share

mqchen · Accepted Answer · 2013-08-16T08:00:39+0000

OPTION 1

You can check goose . It does something similar to what Pocket and Readability do, i.e. Attempts to extract the main article from this web page using a set of heuristics. Apparently, he can also extract the main image from this article, but these are a few hits and misses, so 60% of the time he works every time.

It used to be a Java project, but rewritten in Scala.

From readme file

Goose will try to extract the following information:
The main text of the article
The main image of the article
Any Youtube / Vimeo movies embedded in an article
Meta description
Meta tags
Publication Date

Try it here: http://jimplush.com/blog/goose

OPTION 2

You can use the Java shell (e.g. GhostDriver ) to launch a browser without a browser, such as PhantomJS . Then select a website and find the largest img element. This GhostDriver test case shows how to query the DOM for elements and get its renderd size.

OPTION 3

Use a library like jsoup to help you parse the HTML. Then get the value from the src attribute from all the img tags. Request each URL that you find for the image and rate their sizes. The one with the largest sizes will most likely be the main image of the website.

How to find and extract the "main" image on a website

More articles: