How to clear data from a website when binding to event clicks?

I am trying to clear / retrieve a company / hotel website from Tripadvisor.com web pages. I do not see the website url while checking the page. Any idea on how I can extract the website url using python? I apologize in advance since I just recently started "web scraping in Python." Thanks.

eg. Look at the two red arrows in the image. When I select a link to a website, it takes me to http://www.i-love-my-india.com/ '- this is exactly what I want to extract using Python.

Tripadvisor URL enter image description here

+7
python extract web-scraping scrapy
source share
3 answers

Try using Selenium:

import time from selenium import webdriver browser = webdriver.Firefox(executable_path="C:\\Users\\Vader\\geckodriver.exe") # Must install geckodriver (handles your browser)- see instructions on # http://selenium-python.readthedocs.io/installation.html. # Change the path to where your geckodriver file is. browser.get('https://www.tripadvisor.co.uk/Attraction_Review-g304551-d4590508-Reviews-Ashok_s_Taxi_Tours-New_Delhi_National_Capital_Territory_of_Delhi.html') browser.find_element_by_css_selector('.blEntry.website').click() #browser.window_handles # Results is 2 tabs opened. browser.switch_to.window(browser.window_handles[1]) # changes the browser to # the second one time.sleep(1) # When I went directly I was getting a 'blank' result, so I put # a little delay and it worked (I really do not know why). res = browser.current_url # the URL print(res) browser.quit() # Closes the browser 

Selenium

+7
source share

If you look at the element, you will notice that the redirect URL exists ( data-ahref ), but it is encoded and decoded somewhere in the JS sources. Unfortunately, they are minimized and confusing, so finding a decoder function will be difficult. So you have two options:

Follow call forwarding

Here is what Roberval _T_ suggested in his answer: click on an item, wait a while for the page to load on another tab, take the URL. This is a completely correct answer, which, in my opinion, deserves to be improved, however here is a small technique that I always try when for some reason the data is not available:

Clear mobile web page

The obvious advantage of scraping mobile pages is that they are lighter than desktop pages. But often the mobile site also has data present when the desktop version tries to hide the data for some reason. In this case, all the information (address, homepage, phone) in the mobile version can be immediately captured without explicitly loading the URL. This is what the page looks like when I launch selenium using a mobile user agent:

enter image description here

Sample code using the IPhone user agent:

 from selenium import webdriver from selenium.webdriver.chrome.options import Options url = 'https://www.tripadvisor.co.uk/Attraction_Review-g304551-d4590508-Reviews-Ashok_s_Taxi_Tours-New_Delhi_National_Capital_Territory_of_Delhi.html' chrome_options = Options() chrome_options.add_argument('--user-agent=Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1') driver = webdriver.Chrome(chrome_options=chrome_options) driver.get(url) element = driver.find_element_by_css_selector('div.website.contact_link') link = element.text driver.quit() print(link) 
+4
source share

I would recommend using selenium.

My answer can be seen as an improvement on what @Roberval T. suggested. I find his answer very good for this particular case.

This is my decision. I will point out some of the differences and why I think you should consider them:

 import sys # Selenium from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException # I would use argparse for example try: assert len(sys.argv) == 2 url = sys.argv[1] except AssertionError: # Invalid arguments sys.exit() # Set up the driver driver = webdriver.Chrome() driver.get(url) # Try to load the page a wait until it loaded try: poll_frequency = 5 data_section_id = "taplc_location_detail_header_attractions_0" data_section = WebDriverWait(driver, poll_frequency).until(EC.presence_of_element_located((By.ID, data_section_id))) except TimeoutException: # Could not load page sys.exit() # Get the third child ( relative to the data section div that we get by ID ) try: third_child = data_section.find_elements_by_xpath("./*")[2] except IndexError: sys.exit() # Get the child immediatly under that ( that how the structure looks) container_div = third_child.find_elements_by_xpath("./*")[0] clickable_element = container_div.find_elements_by_xpath("./*")[3] # Click the node clickable_element.click() # Switch tabs driver.switch_to.window(driver.window_handles[1]) try: new_page = WebDriverWait(driver, poll_frequency).until(EC.presence_of_element_located((By.TAG_NAME, "body"))) except TimeoutError: sys.exit() print(driver.current_url) assert driver.current_url == "http://www.i-love-my-india.com/" driver.quit() 
  • First, in my opinion, you should use selenium's special wait mechanisms instead of time.sleep() . This will allow you to better customize your scraper and also make it more reliable. I would suggest you explore WebDriverWait

  • Secondly, I prefer to avoid using class selectors . I am not saying that they are wrong. But experience has shown me that they can easily change, and often when the same class is used in different places (why is it a class). In this particular case, the selection using the CSS class works because this class is used in one place.

    • What happens if in the next version the same class is used elsewhere?

    • Although the structure is not a guarantee, it is probably less likely to change.

  • Use Chrome . Starting with version 59 , Google Chrome has a headless option. In my opinion, it’s much easier to work with Firefox . To switch to Firefox you need to install and run the x server service on the production computer and connect the Firefox instance to this server via geckodriver . You will miss it all with Chrome .


Hope this helps!

+2
source share

All Articles