Using multiple web pages in a web scraper

Question

Using multiple web pages in a web scraper

I worked on some Python code to be able to get links to social media accounts from government websites, for research, which can be used to contact municipalities. I managed to adapt some code to work in version 2.7, which prints all the links to facebook, twitter, linkedin and google + on this input website. The problem I'm currently experiencing is that I am not looking for links on only one web page, but in the list of about 200 sites I have an Excel file. I have no experience importing these lists into Python, so I was wondering if anyone could take a look at the code and suggest the right way to set all these web pages as base_url, if possible;

import cookielib import mechanize base_url = "http://www.amsterdam.nl" br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) br.set_handle_robots(False) br.set_handle_equiv(False) br.set_handle_redirect(True) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] page = br.open(base_url, timeout=10) links = {} for link in br.links(): if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0: links[link.url] = {'count': 1, 'texts': [link.text]} # printing for link, data in links.iteritems(): print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

+8

python social-media

Stefan forch Jan 11 '16 at 9:42

source share

1 answer

Bhargav · Accepted Answer · 2016-01-11T10:12:56+0000

Did you mention that you have an excel file with a list of all sites? Therefore, you can export the excel file as a csv file, which you can then read values from python code.

Here is more information about this .

Here's how to work directly with excel files

You can do something line by line:

 import csv links = [] with open('urls.csv', 'r') as csv_file: csv_reader = csv.reader(csv_file) # Simple example where only a single column of URL is present links = list(csv_reader)

links is now a list of all urls. You can then iterate over the list inside a function that retrieves the page and discards the data.

 def extract_social_links(links): for link in links: base_url = link br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) br.set_handle_robots(False) br.set_handle_equiv(False) br.set_handle_redirect(True) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] page = br.open(base_url, timeout=10) links = {} for link in br.links(): if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0: links[link.url] = {'count': 1, 'texts': [link.text]} # printing for link, data in links.iteritems(): print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

As an aside, you should probably separate the if conditions to make them more readable.

Using multiple web pages in a web scraper

More articles: