A simple webpage change or button is deleted, and scraped data is useless

Question

A simple webpage change or button is deleted, and scraped data is useless

I use many pages that break scratches with a simple delete button or even a slight glitch on the page.

This problem appears a lot, but I'm not sure how to get around it. In fact, since the team, the odds and everything that has disappeared, it gets an xpath with the link: ( //*[contains(@class, "sport-block") and .//div/div]//*[contains(@class, "purple-ar")]) . As expected, but not the team and the odds creating a worthless scratch.

I originally used the CSS selector, but I can't imagine how this would be possible with CSS limitations.

Simple xpath I am after:

 //*[contains(@class, "sport-block") and .//div/div]//*[contains(@class, "purple-ar")]

The problem remains.

I am not very familiar with the ancestors and precedes ... But something like xpath:

ie: //a/ancestor::div[contains(@class, 'xpath')]/preceding-sibling::div[contains(@class, 'xpath')]//a

in

 //a/ancestor::div[contains(@class, 'table-grid')]/preceding-sibling::span[contains(@class, 'sprite-icon arrow-icon arrow-right arrow-purple')]//a

May solve (assuming I can make this work).

  <td class="top-subheader uppercase"> <span> English Premier League Futures </span> </td> </tr> <tr> <td class="content"> <div class="titles"> <span class="match-name"> <a href="/sports-betting/soccer/united-kingdom/english-premier-league-futures/outright-markets-20171226-616961-22079860"> Outright Markets </a> </span> <span class="tv"> 26/12 </span> <span class="other-matches"> <a href="/sports-betting/soccer/united-kingdom/english-premier-league-futures/outright-markets-20171226-616961-22079860" class="purple-arrow">5 Markets <span class="sprite-icon arrow-icon arrow-right arrow-purple"></span> </a> </span>

Any ideas how I can get around this problem? Thanks.

Current output:

 Steaua Bucharest Link for below Celtic Link for below Napoli Link for below Lyon Link for below

Desired:

 Steaua Bucharest LINK FOR Steaua Bucharest Celtic Link Celtic Napoli Link for Napoli Lyon Link for Lyon

Any ideas how I can get around this? Or even a narrowed approach? Constant problem. Thanks.

+8

python css xpath selenium web-scraping

user9045698 Dec 21 '17 at 9:49

source share

1 answer

gabe_ · Answer 1 · 2017-12-27T04:20:26+0000

To keep your data structures intact for each group, I followed through them and used nested (or relative? I'm not sure here terms) XPaths to capture data. Relative XPath can be used when placing . before each request.

I also cleaned a bit:

You grabbed a bunch of links and used them to repeat the pages to completion. I replaced this with a while loop.
I have added many attempts / except to catch as much data as possible.
I added sleep on each new page so that I could upload data (synchronization can be adjusted manually based on your network connection).

Let me know if this solves data consistency issues.

 import csv import os import time from random import shuffle from selenium import webdriver from selenium.common.exceptions import TimeoutException, NoSuchElementException from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait as wait driver = webdriver.Chrome() driver.set_window_size(1024, 600) driver.maximize_window() driver.get('https://crownbet.com.au/sports-betting/soccer') header = driver.find_element_by_tag_name('header') driver.execute_script('arguments[0].hidden="true";', header) header1 = driver.find_element_by_css_selector('div.row.no-margin.nav.sticky-top-nav') driver.execute_script('arguments[0].hidden="true";', header1) # XPaths for the data groups = '//div[@id="sports-matches"]/div[@class="container-fluid"]' xp_match_link = './/span[@class="match-name"]/a' xp_bp1 = './/div[@data-id="1"]//span[@class="bet-party"]' xp_ba1 = './/div[@data-id="3"]//span[@class="bet-amount"]' xp_bp3 = './/div[@data-id="3"]//span[@class="bet-party"]' xp_ba3 = './/div[@data-id="3"]//span[@class="bet-amount"]' while True: try: # wait for the data to populate the tables wait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, (xp_bp1)))) time.sleep(2) data = [] for elem in driver.find_elements_by_xpath(groups): try: match_link = elem.find_element_by_xpath(xp_match_link)\ .get_attribute('href') except: match_link = None try: bp1 = elem.find_element_by_xpath(xp_bp1).text except: bp1 = None try: ba1 = elem.find_element_by_xpath(xp_ba1).text except: ba1 = None try: bp3 = elem.find_element_by_xpath(xp_bp3).text except: bp3 = None try: ba3 = elem.find_element_by_xpath(xp_ba3).text except: ba3 = None data.append([match_link, bp1, ba1, bp3, ba3]) print(data) element = driver.find_element_by_xpath('//span[text()="Next Page"]') driver.execute_script("arguments[0].scrollIntoView();", element) wait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, '//span[text()="Next Page"]'))) element.click() with open('test.csv', 'a', newline='', encoding="utf-8") as outfile: writer = csv.writer(outfile) for row in data: writer.writerow(row) except TimeoutException as ex: pass except NoSuchElementException as ex: print(ex) break

A simple webpage change or button is deleted, and scraped data is useless

More articles: