Batch loading text and images from a URL using Python / urllib / beautifulsoup?

I was looking through a few posts here, but I just can't plunge into batch loading images and text from a given URL using Python.

import urllib,urllib2 import urlparse from BeautifulSoup import BeautifulSoup import os, sys def getAllImages(url): query = urllib2.Request(url) user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" query.add_header("User-Agent", user_agent) page = BeautifulSoup(urllib2.urlopen(query)) for div in page.findAll("div", {"class": "thumbnail"}): print "found thumbnail" for img in div.findAll("img"): print "found image" src = img["src"] if src: src = absolutize(src, pageurl) f = open(src,'wb') f.write(urllib.urlopen(src).read()) f.close() for h5 in div.findAll("h5"): print "found Headline" value = (h5.contents[0]) print >> headlines.txt, value def main(): getAllImages("http://www.nytimes.com/") 

Above there is already updated code. What happens is nothing. The code does not find to find a div with a thumbnail, obviously, there is no result in any of the prints .... Therefore, perhaps I do not have enough pointers to get into the right divs containing images and headers?

Thanks a lot!

+4
source share
1 answer

The OS you use does not know how to write the path to the file that you pass to src . Make sure that the name you use to save the file to disk is the one that the OS can actually use:

 src = "abc.com/alpha/beta/charlie.jpg" with open(src, "wb") as f: # IOError - cannot open file abc.com/alpha/beta/charlie.jpg src = "alpha/beta/charlie.jpg" os.makedirs(os.path.dirname(src)) with open(src, "wb" as f: # Golden - write file here 

and everything will start to work.

A few additional thoughts:

  • Be sure to normalize the path to the save file (for example, os.path.join(some_root_dir, *relative_file_path*) ) - otherwise, you will record images all over the hard drive depending on their src .
  • If you do not perform any tests, it’s good to advertise that you are a bot in your user_agent and honor robots.txt lines (or, alternatively, provide some kind of contact information so people can ask you to stop if they need to).
+1
source

All Articles