Splinter preserves incorporeal html

Question

Splinter preserves incorporeal html

I am using the splinter 0.7.3 module in python 2.7.2 on Linux platform to clear the list of directories on a website using the default Firefox browser.

This is a piece of code that iterates through a broken web list by clicking the "Next" link in html.

  links = True i = 0 while links: with open('html/register_%03d.html' % i, 'w') as f: f.write(browser.html.encode('utf-8')) links = browser.find_link_by_text('Next') print 'links:', links if links: links[0].click() i += 1

I know that links work, as I see output that looks like this:

 links: [<splinter.driver.webdriver.WebDriverElement object at 0x2e6da10>, <splinter.driver.webdriver.WebDriverElement object at 0x2e6d710>] links: [<splinter.driver.webdriver.WebDriverElement object at 0x2e6d5d0>, <splinter.driver.webdriver.WebDriverElement object at 0x2e6d950>] links: [<splinter.driver.webdriver.WebDriverElement object at 0x2e6d710>, <splinter.driver.webdriver.WebDriverElement object at 0x2e6dcd0>] links: []

When html is saved on every page using f.write(browser.html.encode('utf-8')) , it works fine for the first page. On the following pages, although I see pages created in Firefox, either the html/regiser_...html file is empty or the body tag is missing:

 <!DOCTYPE html> <!--[if lt IE 7]> <html prefix="og: http://ogp.me/ns#" class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-gb"> <![endif]--> <!--[if IE 7]> <html prefix="og: http://ogp.me/ns#" class="no-js lt-ie9 lt-ie8" lang="en-gb"> <![endif]--> <!--[if IE 8]> <html prefix="og: http://ogp.me/ns#" class="no-js lt-ie9" lang="en-gb"> <![endif]--> <!--[if gt IE 8]><!--> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb" class="no-js" prefix="og: http://ogp.me/ns#"><!--<![endif]--><head> <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /> ... </style> <script src="/media/com_magebridge/js/frototype.min.js" type="text/javascript"></script></head></html>

Is this a famous html save function from splinter? Is there a better way to do this?

+6

python screen-scraping splinter

Chrisguest Sep 17 '15 at 21:54

source share

1 answer

alecxe · Accepted Answer · 2016-01-01T06:04:47+0000

It really looks like a synchronization problem - you get the page source when the page is not fully loaded. There are several ways to solve the problem:

wait for the presence of body :

 browser.is_element_present_by_tag("body", wait_time=5)

increase the page load time - set this right after the initialization of the browser object:
```
 browser.driver.set_page_load_timeout(10) # 10 seconds 
```

Splinter preserves incorporeal html

More articles: