Python Web Scraping - Urlopen Error [Errno -2] Name or service unknown

I am trying to extract data from Civic Commons Apps for my project. I can get links to the page I need. But when I try to open the links, I get "urlopen error [Errno -2] Name or service unknown"

Python scrambling web code:

from bs4 import BeautifulSoup from urlparse import urlparse, parse_qs import re import urllib2 import pdb base_url = "http://civiccommons.org" url = "http://civiccommons.org/apps" page = urllib2.urlopen(url) soup = BeautifulSoup(page.read()) list_of_links = [] for link_tag in soup.findAll('a', href=re.compile('^/civic-function.*')): string_temp_link = base_url+link_tag.get('href') list_of_links.append(string_temp_link) list_of_links = list(set(list_of_links)) list_of_next_pages = [] for categorized_apps_url in list_of_links: categorized_apps_page = urllib2.urlopen(categorized_apps_url) categorized_apps_soup = BeautifulSoup(categorized_apps_page.read()) last_page_tag = categorized_apps_soup.find('a', title="Go to last page") if last_page_tag: last_page_url = base_url+last_page_tag.get('href') index_value = last_page_url.find("page=") + 5 base_url_for_next_page = last_page_url[:index_value] for pageno in xrange(0, int(parse_qs(urlparse(last_page_url).query)['page'][0]) + 1): list_of_next_pages.append(base_url_for_next_page+str(pageno)) else: list_of_next_pages.append(categorized_apps_url) 

I get the following error:

 urllib2.urlopen(categorized_apps_url) File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 400, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 418, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open raise URLError(err) urllib2.URLError: <urlopen error [Errno -2] Name or service not known> 

Do I have to take care of anything specific when I execute urlopen? Because I do not see a problem with the http links that I receive.

[edit] On the second run, I received the following error:

  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 400, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 418, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open raise URLError(err) 

The same code works fine on my other Mac, but it doesn't work on my ubuntu 12.04.

I also tried to run the code in the scrapper wiki, and it completed successfully. But several URLs were missing (compared to the Mac). Is there a reason for this behavior?

+7
source share
2 answers

The code works on my Mac and on your mac friends. It works great with the Ubuntu 12.04 server virtual machine instance. Obviously, something in your specific environment is your os (Ubuntu Desktop?) Or network, which makes it twitch. For example, my default home router settings changes the number of calls in the same domain in x seconds - and may cause such a problem if I do not disconnect it. This may be a few things.

At this point, I suggest refactoring your code to catch a URLError and highlight the problematic URLs for URLError . Also log / print errors if they do not work after several attempts. Maybe even throw some code at the time of your calls between errors. Itโ€™s better if your script just doesnโ€™t work directly and you get feedback on whether these are just special URLs causing a problem or a problem with synchronization (i.e. if the number of urlopen calls urlopen not work after x, or if it crash after x number of urlopen calls in x number of micro / seconds). If this is a synchronization issue, a simple time.sleep(1) inserted into your loops can do the trick.

+4
source

SyncMaster,

I ran into the same issue recently by jumping to an old ubuntu window that I haven't played with for a while. This problem is actually caused by the DNS settings on your computer. I highly recommend that you check your DNS settings (/etc/resolv.conf and add nameserver 8.8.8.8) and then try again, you should succeed.

+4
source

All Articles