How to get a web page in python including any images

I am trying to restore the source of a webpage, including any images. At the moment I have this:

import urllib page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php') print urlgot.php 

which the source is extracting, but I also need to download any related images.

I thought I could create a regex that looked for img src or the like in a loaded source; however, I was wondering if there is a urllib function that will also receive images? Like the wget command:

 wget -r --no-parent http://127.0.0.1/myurl.php 

I do not want to use the os module and run wget, since I want the script to run on all systems. For this reason, I cannot use third-party modules.

Any help is much appreciated! Thanks

+7
source share
2 answers

Do not use regex if python has a perfectly good parser built in:

 import urllib from HTMLParser import HTMLParser base_url = 'http://127.0.0.1/' class ImgParser(HTMLParser): def __init__(self, *args, **kwargs): self.downloads = [] HTMLParser.__init__(self, *args, **kwargs) def handle_starttag(self, tag, attrs): if tag == 'img': for attr in attrs: if attr[0] == 'src': self.downloads.append( attr[1] ) imgp = ImgParser() with open('test.html') as f: # instead you could feed it the original url obj directly imgp.feed(f.read()) for path in imgp.downloads: url = base_url + path print url urllib.urlretrieve(url, path) 
+6
source

Use BeautifulSoup to parse returned HTML and find links to images. You may also need a recursive selection of frames and frames.

+3
source

All Articles