How to get a web page in python including any images

Question

How to get a web page in python including any images

I am trying to restore the source of a webpage, including any images. At the moment I have this:

import urllib page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php') print urlgot.php

which the source is extracting, but I also need to download any related images.

I thought I could create a regex that looked for img src or the like in a loaded source; however, I was wondering if there is a urllib function that will also receive images? Like the wget command:

 wget -r --no-parent http://127.0.0.1/myurl.php

I do not want to use the os module and run wget, since I want the script to run on all systems. For this reason, I cannot use third-party modules.

Any help is much appreciated! Thanks

+7

python urllib

Jingo Sep 05 '11 at 20:58

source share

2 answers

Use BeautifulSoup to parse returned HTML and find links to images. You may also need a recursive selection of frames and frames.

+3

Marcelo cantos Sep 05 '11 at 21:15

source share

Gringo suave · Accepted Answer · 2011-09-05T23:53:26+0000

Do not use regex if python has a perfectly good parser built in:

 import urllib from HTMLParser import HTMLParser base_url = 'http://127.0.0.1/' class ImgParser(HTMLParser): def __init__(self, *args, **kwargs): self.downloads = [] HTMLParser.__init__(self, *args, **kwargs) def handle_starttag(self, tag, attrs): if tag == 'img': for attr in attrs: if attr[0] == 'src': self.downloads.append( attr[1] ) imgp = ImgParser() with open('test.html') as f: # instead you could feed it the original url obj directly imgp.feed(f.read()) for path in imgp.downloads: url = base_url + path print url urllib.urlretrieve(url, path)

How to get a web page in python including any images

More articles: