How can I extract images from the site I'm linking to?

If you are familiar with Reddit, you will learn how all of their messages containing pictures get a small preview of the thumbnails next to the title of the view. How does reddit do this? It just checks if the link ends with .jpg, .png, .bmp , etc.

+6
image hyperlink
source share
3 answers

reddit will try to pull the thumbnail from any source, not just the image URL. This is done, firstly, by setting rules for certain sites, and secondly, using one generic process to retrieve thumbnails for unknown URLs and this is an automatic periodic task .

One of the (advantages) of reddit is that the source code is open , and if you understand Python, you should check /r2/lib/scraper.py for a more detailed look at how this process works.

Also, while StackOverflow is a great place to answer programming questions, you can also check reddit's own / r / redditdev for information on developing reddit.

Hey there redditor!

+3
source share
  • Indeed, if the URL contains .jpg, .png, etc., use this.
  • If the site is a popular domain (flickr.com, youtube.com, amazon.com, etc.), a set of predefined rules for retrieving what you know will be relevant (maybe this is an image, YouTube thumbnail, an Amazon product image , and etc.)
  • Otherwise, if all you need is working with some kind of HTML, you will have to dig it out yourself. You can choose the first one on the page, the largest in size, or even the one that you are algorithmically determined to be the most significant (for example, relatively large, inside what, in your opinion, is the main content of the body.)

If you need to resort to the latter option, one method that I would recommend is to extract multiple images, and A / B will check them to find the one that has the best click-through rate. That way you can almost always get the best.

+1
source share

You can check the contents of the <img> .

0
source share

All Articles