How to download multiple files and images from a website using python

So, I'm trying to download several files from the site and save it to a folder. I am trying to get highway information and on their website ( http://www.wsdot.wa.gov/mapsdata/tools/InterchangeViewer/SR5.htm ) - this is a list of PDF links. I want to create code that will extract the numerous pdf files found on their website. It is possible to create a loop that will go through the website and extract and save each file to a local folder on my desktop. Does anyone know how I can do this?

+4
source share
3 answers

Since your goal is to batch download pdf files, the easiest way is not to write a script, but to use command-line software. Internet Download Manager can simply compete with what you need in two stages:

  • Copy all of these texts, including links to webbrowser.
  • Choose Task> Add Batch Download From Clipboard.

enter image description here

-2
source

This is a coding problem. I can point you to some tools for this, but not a complete solution.

Request Library: Communicating with an HTTP Server (Websites)

http://docs.python-requests.org/en/latest/

BeautifulSoup: Html Parser (parsing the source code of the site)

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Example:

>>> import requests
>>> from bs4 import BeautifulSoup as BS
>>> 
>>> response = requests.get('http://news.ycombinator.com')
>>> response.status_code # 200 == OK
200
>>> 
>>> soup = BS(response.text) # Create a html parsing object
>>>
>>> soup.title # Heres the browser title tag
<title>Hacker News</title>
>>>
>>> soup.title.text # The contents of the tag
u'Hacker News'
>>> 
>>> # Heres some article posts
... 
>>> post_containers = soup.find_all('tr', attrs={'class':'athing'})
>>> 
>>> print 'There are %d article posts.' % len(post_containers)
There are 30 article posts.
>>> 
>>> 
>>> # The article name is the 3rd and last object in a post_container
... 
>>> for container in post_containers:
...     title = container.contents[-1] # The last tag
...     title.a.text # Grab the `a` tag inside our titile tag, print the text
... 
u'Show HN: \u201cWho is hiring?\u201d Map'
u'\u2018Flash Boys\u2019 Programmer in Goldman Case Prevails Second Time'
u'Forthcoming OpenSSL releases'
u'Show HN: YouTube Filesystem \u2013 YTFS'
u'Google launches Uber rival RideWith'
u'Finish your stuff'
u'The Plan to Feed the World by Hacking Photosynthesis'
u'New electric engine improves safety of light aircraft'
u'Hacking Team hacked, attackers claim 400GB in dumped data'
u'Show HN: Proof of concept \u2013 Realtime single page apps'
u'Berkeley CS 61AS \u2013 Structure and Interpretation of Computer Programs, Self-Paced'
u'An evaluation of Erlang global process registries: meet Syn'
u'Show HN: Nearby Buzz \u2013\xa0Take control of your online reviews'
u"The Grateful Dead Wall of Sound"
u'The Effects of Intermittent Fasting on Human and Animal Health'
u'JsCoq'
u'Taking stock of startup innovation in the Netherlands'
u'Hangout: Becoming a freelance developer'
u'Panning for Pangrams: The Search for the New Quick Brown Fox'
u'Show HN: MUI \u2013 Lightweight CSS Framework for Material Design'
u"Intel 10nm 'Cannonlake' delayed, replaced by 14nm 'Kaby Lake'"
u'VP of Logistics \u2013 EasyPost (YC S13) Hiring'
u'Colorado\u2019s Effort Against Teenage Pregnancies Is a Startling Success'
u'Lexical Scanning in Go (2011)'
u'Avoiding traps in software development with systems thinking'
u"Apache Cordova: after 10 months, I won't using it anymore"
u'An exercise in profiling a Go program'
u"The Science of Pixar \u2018Inside Out\u2019"
u'Ask HN: What tech blogs, podcasts do you follow outside of HN?'
u'NASA\u2019s New Horizons Plans July 7 Return to Normal Science Operations'
>>> 
+3
source

Python urllib PDF . . pdf urllib?.

PDF , xml.

website = urllib.urlopen('http://www.wsdot.wa.gov/mapsdata/tools/InterchangeViewer/SR5.htm').read()
root = ET.fromstring(website)
list = root.findall('table')
hrefs = list.findall('a')
for a in hrefs:
  download(a)
0

All Articles