Scrambling generated javascript data using Python

Question

Scrambling generated javascript data using Python

I want to clear some data of the following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

This is about company summary information.

What I want to clear does not appear on the first page. By clicking on the tab with the name "재무 제표", you can access the financial statements. And by clicking the tab named "현금 흐름표", you can access the "Cash Flow".

I want to clear the Cash Flow data.

However, cash flow data is generated by javascript throughout the url. The following link is a hidden url, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

Cash flow data is generated by sending some parameter value and cookie to this URL.

As you understand, itemcode = 078340 in the first link means the stock code, and there are 1680 shares that I want to collect cash flow data. I want to create a loop structure.

Is there a good way to clear cash flow data? I tried to script, but it's hard for me to handle my other scratching code that I use.

+8

javascript python web-scraping screen-scraping

trigger Apr 7 '12 at 6:56

source share

2 answers

Niklas B. · Answer 1 · 2012-04-07T10:20:31+0000

There is also dryscape (a library written by me, so the recommendation is a bit biased, obviously :), which uses fast Webkit-based in-memory navigation. He also understands Javascript, but much easier than Selenium.

Mikko ohtamaa · Answer 2 · 2012-04-07T10:16:25+0000

If you need to view the contents of a page that has been updated using AJAX, and you do not control this AJAX interface, I would use the Selenium automatic browser to perform this task:

http://code.google.com/p/selenium/

Selenium has Python bindings
It launches a real browser instance so that it can execute and clear 100% the same thing that you see with your own eyes.
Retrieve HTML document content after AJAX updates through Selenium API
Use the lxml + xpath / CSS selectors to parse the relevant parts from the document.

Scrambling generated javascript data using Python

More articles: