Web page scraper with java script in Python

I work in python 3.2 (newb) on a Windows computer (although I have ubuntu 10.04 on a virtual field, if necessary, but I prefer to work on a Windows machine).

Basically, I can work with the http module and urlib module to delete web pages, but only those that don't have java script document.write ("

To handle such sites, I am quite sure that I need a java script browser to work on the page and give me the result with the final result, I hope, like a dict or text.

I tried to compile a python-spider monkey, but I understand that it is not for Windows and does not work with python 3.x: -?

Any suggestions? if anyone had done something like this before I would be grateful for the help!

0
source share
3 answers

I recommend python bindings to the webkit library - here is an example . Webkit is a cross platform and is used to render web pages in Chrome and Safari. Great library.

+2
source

Use Firebug to see what exactly is being called to display the data (POST or GET url?). I suspect there is an AJAX call that retrieves data from the server either as XML or JSON. Just call the same AJAX call and analyze the data yourself.

Optionally, you can download Selenium for Firefox, start the Selenium server, load the page through Selenium and get the contents of the DOM. MozRepl also works, but does not have such documentation, since it is not widely used.

+1
source

document.write is commonly used because you generate content on the fly, often by retrieving data from the server. What you get are web applications that are more concerned with javascript than HTML. A scraper is more a matter of loading HTML and processing it, but there is no HTML to load here. You are essentially trying to clear the GUI program.

Most of these applications have an API that often returns XML or JSON data, which you can use instead. If this is not the case, you should probably try to remotely control a real web browser.

0
source

All Articles