Automate webpage interaction in python

I want to automate interaction with a web page. I am using pycurl so far, but the webpage will eventually use javascript, so I am looking for alternatives. Typical interaction: "open the page, find the text, click the link (which opens the form), fill out the form and submit."

We are deploying the Google App engine, if that matters.

Clarification: we are deploying a webpage in appengine. But the interaction is performed on a separate machine. So selenium seems to be the best choice.

+4
source share
5 answers

What about Selena? ( http://seleniumhq.org )

+4
source

Twill and mechanize do not do Javascript, and Qt and Selenium cannot work in App Engine ((1)), which only supports pure Python code. I do not know any pure Python Javascript interpreter that you will need to deploy a JS-enabled scraper in App Engine: - (.

Maybe there is something in Java that will at least allow you to deploy the (Java version) of App Engine? App Engine applications for Java and Python applications can use the same data store, so you can save part of your Python application ... just not the part that Javascript needs to understand. Unfortunately, I don't know enough about the Java / AE environment to offer any particular package to try.

((1)): clarify, since there seems to be a misunderstanding that went so far as to get me started: if you run Selenium or other scrapers on another computer, you can, of course, aim at a deployed site in the Engine application (it doesn't matter, how the website you are aiming for is being deployed, what programming language it uses, etc., etc., if you can access it [[real site: flash, & c, may be different ]]). As I read the question, OP is looking for ways to do scraping as part of the App Engine application - which is the problematic part, not where you (or someone else ;-) launches the site that is being cleaned!

+6
source

Have you tried using QtWebKit with PyQt, you can download a specific URL and read the contents with Python. You can then search for URLs and use Webkit again to access it. I think all this can be done with some basic Django (assuming you use Django on GAE) to check the response code. Here is a sample QtWebKit PyQt code to get started if you want to do this using the GUI:

import sys import time from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import * app = QApplication(sys.argv) web = QWebView() settings = web.settings() settings.setAttribute(QWebSettings.PluginsEnabled, True) settings.setAttribute(QWebSettings.JavaEnabled, True) settings.setAttribute(QWebSettings.JavascriptCanOpenWindows, True) settings.setAttribute(QWebSettings.JavascriptCanAccessClipboard, True) settings.setAttribute(QWebSettings.DeveloperExtrasEnabled, True) settings.setAttribute(QWebSettings.ZoomTextOnly, True) settings.setOfflineStoragePath('.') settings.setIconDatabasePath (".") url = 'http://stackoverflow.com' web.load(QUrl(url)) web.show() sys.exit(app.exec_()) 
+1
source

Check mechanize . It should easily cope with your "typical interaction." Another option might be Selenium , but I never used it personally.

0
source

twill is very light but works well.

0
source

All Articles