I am clearing a site with a lot of javascript that is generated when the page is called. As a result, the traditional methods of web cleaning (beautifulsoup, ect.) Do not work for my purposes (at least I was not successful for their work, all the important data is in parts of javascript). As a result, I started using selenium webdriver. I need to clear several hundred pages, each of which contains from 10 to 80 data points (each of which contains about 12 fields), so it is important that this script (is that the correct terminology?) Can work for quite a while without I need to look for her.
I have code running on one page, and I have a control section that tells the cleanup section which page is being cleared. The problem is that sometimes the parts of loading the javascript page, and sometimes not when they (~ 1/7), updates fix things, but sometimes the update delays the webdriver and therefore the python runtime like Well. Annoyingly, when it freezes so, the code does not have time. What's happening?
Here is a stripped down version of my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
HealthPlan = namedtuple( "HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
try:
driver.find_element_by_xpath('//*[@id="showAll"]').text
except NoSuchElementException:
try:
driver.refresh()
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage()
I searched many times for similar problems, and I donβt think that someone else posted about this on this or that other that I looked. I am using python 2.7, selenium 2.39.0, and I am trying to clear Healthcare.gov to get premium score pages
EDIT: (, ) , , ( , , ), , try/except.
EDIT2: , Windows7 64bit, firefox 17 (, , )