You have a couple of options - quick and dirty or right. The quick and dirty way (which breaks easily when you change the layout) looks like
>>> from BeautifulSoup import BeautifulSoup >>> import re >>> soup = BeautifulSoup('<html><body><img style="background:url(/theRealImage.jpg) no-repate 0 0; height:90px; width:92px;") src="notTheRealImage.jpg"/></body></html>') >>> style = soup.find('img')['style'] >>> urls = re.findall('url\((.*?)\)', style) >>> urls [u'/theRealImage.jpg']
Obviously, you will have to play around with this to get it working with multiple img tags.
The right way, as it would be terrible for me to assume that someone is using a regular expression in a CSS string :), using a CSS parser. cssutils , the library I just found on Google and available on PyPi, looks like it can do the job.
Matt luongo
source share