Python urllib2 - wait for the page to finish loading / redirecting before clearing?

I am learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing html with urllib2. However, I ran into a problem when, using the code below, the html that I am returning is incorrect, as the page seems to take the second place for redirecting (you can check this by visiting the url) - instead, I get the code with a page that initially appears briefly.

Is there any behavior or parameter to make sure the page has completed loading / redirecting completely before getting the contents of the website?

import urllib2 from bs4 import BeautifulSoup bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6") soup = BeautifulSoup(bostonPage) print soup.prettify() 

Edit: the answer is completed, however, in the end, which resolved my problem: https://stackoverflow.com/a/464829/

+8
python urllib2
source share
1 answer

The simplification of the problem is not redirecting, because the page modifies the content using javascript, but urllib2 does not have the JS engine just like GETS data, if you disabled javascript in your browser, you will notice that it loads basically the same content as urllib2 returns

 import urllib2 from BeautifulSoup import BeautifulSoup bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6") soup = BeautifulSoup(bostonPage) open('test.html', 'w').write(soup.read()) 

test.html and disable JS in your browser, the easiest way in the contents of firefox is → uncheck the box, enable javascript, create identical result sets.

So, what can we do well, first we need to check whether the site offers an API, refusal to fake http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

Travel / hotel API? it seems they can, albeit with some limitations.

But if we still need to clean it with JS, we can use selenium http://seleniumhq.org/ , which is mainly used for testing, but it is lightweight and has pretty good documents.

I also found this scraping sites with Javascript enabled? and this is http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

hope this helps.

As a note:

 >>> import urllib2 >>> from bs4 import BeautifulSoup >>> >>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6") >>> value = bostonPage.read() >>> soup = BeautifulSoup(value) >>> open('test.html', 'w').write(value) 
+5
source share

All Articles