How do large sites capture thumbnails from a link?

Question

How do large sites capture thumbnails from a link?

When sharing links on major websites such as Digg and Facebook; it will create thumbnails by capturing the main page images. How do they capture images from a web page? Does this include loading the entire page (e.g. cURL) and parsing it (e.g. using preg_match)? For me, this method is slow and unreliable. Do they have a more practical method?

PS I think there should be a practical method for quickly crawling a page, skipping some parts (like CSS and JS) in order to achieve the src attributes. Any idea?

+4

html php regex curl preg-match

Googlebot 18 sept. '11 at 14:18

source share

4 answers

JohnD's answer shows that Reddit uses embed.ly as part of its Python solution. In fact, embedding does the hard part of image search, and they are free up to 10,000 queries / month.

+1

crizCraig Jan 9 '12 at 20:18

source share

Usually they use a tool like webkit2png .

0

ceejayoz 18 sept. '11 at 16:56

source share

Some use

<link rel="image_src" href="yourimage.jpg" />

included at the top of the page. See http://www.labnol.org/internet/design/set-thumbnail-images-for-web-pages/6482/

Facebook uses

 <meta property="og:image" content="thumbnail_image" />

see http://developers.facebook.com/docs/share/#basic-tags

-1

Gerben 18 sept. '11 at 14:40

source share

Johnd · Accepted Answer · 2011-09-18T17:02:06+0000

They usually look for the image on the page and scale it on their servers. Reddit scraper code shows a lot of what they do. the Scraper class should give you some good ideas on how to handle this.

How do large sites capture thumbnails from a link?

More articles: