The simplification of the problem is not redirecting, because the page modifies the content using javascript, but urllib2 does not have the JS engine just like GETS data, if you disabled javascript in your browser, you will notice that it loads basically the same content as urllib2 returns
import urllib2 from BeautifulSoup import BeautifulSoup bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6") soup = BeautifulSoup(bostonPage) open('test.html', 'w').write(soup.read())
test.html and disable JS in your browser, the easiest way in the contents of firefox is → uncheck the box, enable javascript, create identical result sets.
So, what can we do well, first we need to check whether the site offers an API, refusal to fake http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available
Travel / hotel API? it seems they can, albeit with some limitations.
But if we still need to clean it with JS, we can use selenium http://seleniumhq.org/ , which is mainly used for testing, but it is lightweight and has pretty good documents.
I also found this scraping sites with Javascript enabled? and this is http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
hope this helps.
As a note:
>>> import urllib2 >>> from bs4 import BeautifulSoup >>> >>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6") >>> value = bostonPage.read() >>> soup = BeautifulSoup(value) >>> open('test.html', 'w').write(value)