How do large sites capture thumbnails from a link?

When sharing links on major websites such as Digg and Facebook; it will create thumbnails by capturing the main page images. How do they capture images from a web page? Does this include loading the entire page (e.g. cURL) and parsing it (e.g. using preg_match)? For me, this method is slow and unreliable. Do they have a more practical method?

PS I think there should be a practical method for quickly crawling a page, skipping some parts (like CSS and JS) in order to achieve the src attributes. Any idea?

+4
source share
4 answers

They usually look for the image on the page and scale it on their servers. Reddit scraper code shows a lot of what they do. the Scraper class should give you some good ideas on how to handle this.

+2
source

JohnD's answer shows that Reddit uses embed.ly as part of its Python solution. In fact, embedding does the hard part of image search, and they are free up to 10,000 queries / month.

+1
source

Usually they use a tool like webkit2png .

0
source

Some use

<link rel="image_src" href="yourimage.jpg" /> 

included at the top of the page. See http://www.labnol.org/internet/design/set-thumbnail-images-for-web-pages/6482/

Facebook uses

 <meta property="og:image" content="thumbnail_image" /> 

see http://developers.facebook.com/docs/share/#basic-tags

-1
source

All Articles