Clean up ajax web page with python and / or scrapy

Question

Clean up ajax web page with python and / or scrapy

What I want to do is to clear the petition data - name, city, state, date, signature number - from one or more petitions at petitions.whitehouse.gov

I assume that at the moment, python is the way to go - perhaps the scrapy library - along with some functions to address the ajax aspects of the site. The reason for this scraper is that these petitions are not available to the public.

I am an independent technical journalist , and I want to be able to dump the data of each petition into a CSV file, to analyze the number of people from each state who sign the state petition, and with the data from several petitions, find the number of people signing several petitions, etc. etc., and then draw some conclusions about the political viability of the application and the data themselves.

The petition acts in the petitions .whitehouse.gov works as a Drupal module, and the White House developers answered my request for a problem on github https://github.com/WhiteHouse/petition/issues/44 that they are working on an API to allow access to the petition data from the module. But there is no release date for this API; and this does not solve the problem of petition data currently on .whitehouse.gov petitions.

I emailed the White House and the White House developers , stating that I am a freelance journalist and asking for some way to access the data. The White House Digital Strategy Office informed me that "unfortunately, we do not have the means to export data at this time, but we are working to open the data through the API." There is the Open Data initiative in the White House, but apparently the petition data is not being considered.

Confidentiality and TOS: Little confidentiality is expected when signing a petition. And there is no clear TOS that affects web cleaning of this data.

What was done: Some teachers from UNC wrote (what I suppose) a python script to clear the data, but they don’t want to issue a script for I said that they are still working on it. http://www.unc.edu/~ncaren/secessionists/ They sent me a CSV data dump from one petition in which I am particularly interested.

What I did: I created a github project for this, because I want the petition data scraper to be useful to everyone - petitioners themselves, journalists, etc. - who wants to get this data. https://github.com/markratledge/whitehousescraper

I have no experience with python and little experience with shell scripts, and what I'm trying to do clearly does not match my experience at the moment.

I ran a GUI script to send a space to the web browser every five seconds or so, and thus cleared ~ 10,000 signatures, cutting and pasting the browser text into a text editor. From there, I could process the text using grep and awk in CSV format. This, of course, does not work too well; Chrome got bogged down with page size and it took several hours to get a lot of signatures.

What I have found so far: from what I can compile from other SO questions and answers, this is similar to Python and scrapy http://scrapy.org is a way to avoid browser issues. But the page uses the ajax function to load the next set of signatures. This seems to be a “static” ajax request because the url is not changing.

In Firebug, JSON request headers seem like a random string appended to them, with the page number immediately before it. Does this mean anything about what needs to be done? Does the script need to emulate and send them to a web server?

Request URL: https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/2/50b32771ee140f072e000001 Request URL: https://petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/3/50b640f40e00fe00610ff /petitions.whitehouse.gov/signatures/more/50ab2aa8eab72abc4a000020/4/50afb3d7c988d47504000004

This is a JS function that loads signatures on a page:

(function ($) { Drupal.behaviors.morePetitions = { attach: function(context) { $('.petition-list .show-more-petitions-bar').unbind(); $(".petition-list .show-more-petitions-bar").bind('click', function () { $('.show-more-petitions-bar').addClass('display-none'); $('.loading-more-petitions-bar').removeClass('display-none'); var petition_sort = retrieveSort(); var petition_cols = retrieveCols(); var petition_issues = retrieveIssues(); var petition_search = retrieveSearch(); var petition_page = parseInt($('#page-num').html()); var url = "/petitions/more/"+petition_sort+"/"+(petition_page + 1)+"/"+petition_cols+"/"+petition_issues+"/"+petition_search+"/"; var params = {}; $.getJSON(url, params, function(data) { $('#petition-bars').remove(); $('.loading-more-petitions-bar').addClass('display-none'); $('.show-more-petitions-bar').removeClass('display-none'); $(".petition-list .petitions").append(data.markup).show(); if (typeof wh_petition_adjustHeight == 'function') { wh_petition_adjustHeight(); } Drupal.attachBehaviors('.petition-list .show-more-petitions-bar'); if (typeof wh_petition_page_update_links == 'function') { wh_petition_page_update_links(); } }); return false; } ); } }

and this fires when this div is displayed when scrolling to the bottom of the browser window:

 <a href="/petition/.../l76dWhwN?page=2&amp;last=50b3d98e7043012b24000011" class="load-next no-follow active" rel="509ec31cadfd958d58000005">Load Next 20 Signatures</a> <div id="last-signature-id" class="display-none">50b3d98e7043012b24000011</div>

So what is the best way to do this? . Where can I go with radiation therapy? Or is there another python library that is better suited for this?

Feel free to comment, point me in the direction with code abbreviations, for other SO questions / answers, contribute to github. What I'm trying to do obviously does not match my experience at the moment.

+6

python ajax web-scraping scrapy

markratledge Nov 27 '12 at 15:46

source share

2 answers

Dragon · Answer 1 · 2012-11-28T10:27:41+0000

A "random link" looks like this:

https://petitions.whitehouse.gov/signatures/more/ petitionid / pagenum / lastpetition where petitionid is static for one petition, pagenum is incremented each time, and lastpetition is returned each time from the request.

My usual approach would be to use a request library to emulate a session for cookies, and then determine what requests the browser makes.

 import requests s=requests.session() url='http://httpbin.org/get' params = {'cat':'Persian', 'age':3, 'name':'Furball'} s.get(url, params=params)

I would pay particular attention to the following link:

<a href="/petition/shut-down-tar-sands-project-utah-it-begins-and-reject-keystone-xl-pipeline/H1MQJGMW?page=2&last=50b5a1f9ee140f227a00000b" class="load-next no-follow active" rel="50ae9207eab72aed25000003">Load Next 20 Signatures</a>

moonsly · Answer 2 · 2012-11-27T17:38:22+0000

It's hard to fully imitate jQuery / Javascript with Python. You can look at spidermonkey or in web testing automation tools like Selenium, which can fully automate any browser activity. Previous question about SO: How Python can work with javascript

Clean up ajax web page with python and / or scrapy

More articles: