The problem is actually quite complicated: the site uses dynamically generated content that is loaded via JavaScript, however urllib basically only gets what you get in the browser if you turned off JavaScript. So what can we do?
Use
to fully display the web page (they are essentially headless, automated browsers for testing and cleaning)
Or, if you want a (semi-) clean Python solution, use PyQt4.QtWebKit to render the page. It works something like this:
import sys import signal from optparse import OptionParser from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import QWebPage url = "http://www.stackoverflow.com" def page_to_file(page): with open("output", 'w') as f: f.write(page.mainFrame().toHtml()) f.close() app = QApplication() page = QWebPage() signal.signal( signal.SIGINT, signal.SIG_DFL ) page.connect(page, SIGNAL( 'loadFinished(bool)' ), page_to_file) page.mainFrame().load(QUrl(url)) sys.exit( app.exec_() )
Edit: There is a nice explanation here how it works here .
Ps: You might want to examine requests instead of urllib :)
source share