How to use python urlopen scraping after page completion loading all search results?

I am trying to clear flight information (including plan information and price information, etc.) from http://flight.qunar.com/ using python3 and BeautifulSoup. Below is the python code I am using. In this code, I tried to copy the flight information from Beijing (εŒ—δΊ¬) to Lijiang (丽江) on 2012-07-25.

import urllib.parse import urllib.request from bs4 import BeautifulSoup url = 'http://flight.qunar.com/site/oneway_list.htm' values = {'searchDepartureAirport':'εŒ—δΊ¬', 'searchArrivalAirport':'丽江', 'searchDepartureTime':'2012-07-25'} encoded_param = urllib.parse.urlencode(values) full_url = url + '?' + encoded_param response = urllib.request.urlopen(full_url) soup = BeautifulSoup(response) print(soup.prettify()) 

I get the start page after sending the request, and the page still loads the search results. What I want is the last page after the download of the search results is complete. So how can I achieve this goal with python?

+4
source share
1 answer

The problem is actually quite complicated: the site uses dynamically generated content that is loaded via JavaScript, however urllib basically only gets what you get in the browser if you turned off JavaScript. So what can we do?

Use

to fully display the web page (they are essentially headless, automated browsers for testing and cleaning)

Or, if you want a (semi-) clean Python solution, use PyQt4.QtWebKit to render the page. It works something like this:

 import sys import signal from optparse import OptionParser from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import QWebPage url = "http://www.stackoverflow.com" def page_to_file(page): with open("output", 'w') as f: f.write(page.mainFrame().toHtml()) f.close() app = QApplication() page = QWebPage() signal.signal( signal.SIGINT, signal.SIG_DFL ) page.connect(page, SIGNAL( 'loadFinished(bool)' ), page_to_file) page.mainFrame().load(QUrl(url)) sys.exit( app.exec_() ) 

Edit: There is a nice explanation here how it works here .

Ps: You might want to examine requests instead of urllib :)

+7
source

All Articles