Python3 urllib image retreval

Question

Python3 urllib image retreval

I am writing a small Python script to capture images through Google images. I managed to bring the matter to the point that I have the URLs of the images I want in a convenient list. Now I just need to capture them ...

for each image url i do this:

print("Retrieving:{0}".format(sFinalImageURL)) sExt = sFinalImageURL.split('.')[-1] #u = urllib.request.urlopen(sFinalImageURL) try: u = urllib.request.urlopen(sFinalImageURL) except: print("error: cannot retrieve image") continue raw_data = u.read() print("read {0} bytes".format(len(raw_data))) u.close() global sImagesFolder try: f = open("{0}/{1}_{2}.{3}".format(sImagesFolder,sImage,i,sExt),'wb') f.write(raw_data) f.close() except: print("couldn't write to {0}/{1}_{2}.{3}".format(sImagesFolder,sImage,i,sExt)) print()

Here are the issues I am facing:

trying to open some of the urls gives me 403, although I can open the urls directly in my browser. So there is something in the header of the HTTP request that the image server does not like ... any ideas?

Here are some of the results:

 Retrieving:http://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Timba%2B1.jpg/220px-Timba%2B1.jpg error: cannot retrieve image Retrieving:http://upload.wikimedia.org/wikipedia/commons/thumb/2/26/YellowLabradorLooking_new.jpg/260px-YellowLabradorLooking_new.jpg error: cannot retrieve image Retrieving:http://1.bp.blogspot.com/-7SsJ1n3RdoA/Tf07NOgD5nI/AAAAAAAAABo/tl8qLLIU01Y/s1600/english-shepherd-dog-0003.jpg read 11123 bytes Retrieving:http://completedogfood.net/wp-content/uploads/2010/07/complete-dog-food.bmp read 419630 bytes

+4

python-3.x urllib

Sheena Jun 08 '12 at 8:38

source share

1 answer

Oleh prypin · Accepted Answer · 2012-06-08T09:47:14+0000

Wikipedia seems to allow access to real browsers.
The problem can be solved by specifying the User-Agent real browser, since Python urllib sends something like Python-urllib/3.2 by default.

Here is an example that works (with the User-Agent string of the browser used):

 url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Timba%2B1.jpg/220px-Timba%2B1.jpg' user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19' u = urllib.request.urlopen(urllib.request.Request(url, headers={'User-Agent': user_agent}))

Python3 urllib image retreval

More articles: