Scrambling generated javascript data using Python

I want to clear some data of the following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

This is about company summary information.

What I want to clear does not appear on the first page. By clicking on the tab with the name "재무 μ œν‘œ", you can access the financial statements. And by clicking the tab named "ν˜„κΈˆ νλ¦„ν‘œ", you can access the "Cash Flow".

I want to clear the Cash Flow data.

However, cash flow data is generated by javascript throughout the url. The following link is a hidden url, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

Cash flow data is generated by sending some parameter value and cookie to this URL.

As you understand, itemcode = 078340 in the first link means the stock code, and there are 1680 shares that I want to collect cash flow data. I want to create a loop structure.

Is there a good way to clear cash flow data? I tried to script, but it's hard for me to handle my other scratching code that I use.

+8
javascript python web-scraping screen-scraping
source share
2 answers

There is also dryscape (a library written by me, so the recommendation is a bit biased, obviously :), which uses fast Webkit-based in-memory navigation. He also understands Javascript, but much easier than Selenium.

+9
source share

If you need to view the contents of a page that has been updated using AJAX, and you do not control this AJAX interface, I would use the Selenium automatic browser to perform this task:

http://code.google.com/p/selenium/

  • Selenium has Python bindings

  • It launches a real browser instance so that it can execute and clear 100% the same thing that you see with your own eyes.

  • Retrieve HTML document content after AJAX updates through Selenium API

  • Use the lxml + xpath / CSS selectors to parse the relevant parts from the document.

+1
source share

All Articles