Mechanize and Python by href = "javascript: void (0);" links and answer

I need to cancel some data from the page where I fill out the form (already done this with mechanization). The problem is that the page returns data on many pages, and I am having problems getting data from these pages.

There is no problem to get them from the first page of results, because it is displayed after the search - I just submit the form and get the answer.

I have analyzed the source code of the results page, and it looks like it uses Java Script, RichFaces (some lib for JSF with ajax, but I can be wrong, since I'm not an expert on the network).

However, I managed to figure out how to get to the rest of the results pages. I need to click the links that are in this form ( href="javascript:void(0);" , the full code below):

 <td class="pageNumber"><span class="rf-ds " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233"><span class="rf-ds-nmb-btn rf-ds-act " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1">1</span><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2">2</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3">3</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4">4</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5">5</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6">6</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7">7</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8">8</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9">9</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10">10</a><a class="rf-ds-btn rf-ds-btn-next" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next">ยป</a><a class="rf-ds-btn rf-ds-btn-last" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l">ยปยปยปยป</a> <script type="text/javascript">new RichFaces.ui.DataScroller("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",function(event,element,data){RichFaces.ajax("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",event,{"parameters":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233:page":data.page} ,"incId":"1"} )},{"digitals":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9":"9","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8":"8","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7":"7","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6":"6","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5":"5","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4":"4","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3":"3","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1":"1","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10":"10","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2":"2"} ,"buttons":{"right":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next":"next","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l":"last"} } ,"currentPage":1} )</script></span></td> <td class="pageExport"><script type="text/javascript" src="/opi/javax.faces.resource/download.js?ln=js/component&amp;b="></script><script type="text/javascript"> 

So, I would like to ask if there is a way to click on all the links and get all the pages using mechanization (note that there are more pages available after the ยป symbol)? I am asking about answers to common mannequins using web knowledge :)

+7
javascript python ajax mechanize mechanize-python
source share
2 answers

First of all, I'll stick with selenium anyway, as this is a pretty "javascript-heavy" website. Please note that you can use a mute browser ( PhantomJS or via the virtual display ) if necessary.

The idea here is to split pages into 100 lines per page, click on the โ€œโ†’โ€ link until it appears on the page, which means that we got to the last page, and more results to process, to make the solution reliable, we need to use Explicit Waits : every time we go to the next page, wait for the bootloader to be invisible.

Working implementation:

 # -*- coding: utf-8 -*- from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.by import By from selenium import webdriver from selenium.webdriver.support.select import Select from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.maximize_window() driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1') wait = WebDriverWait(driver, 30) # paginate by 100 select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220")) select.select_by_visible_text("100") while True: # wait until there is no loading spinner wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller"))) current_page = driver.find_element_by_class_name("rf-ds-act").text print("Current page: %d" % current_page) # TODO: collect the results # proceed to the next page try: next_page = driver.find_element_by_link_text(u"ยป") next_page.click() except NoSuchElementException: break 
+4
source share

This works for me: it seems that all html is available in page

 import time from selenium import webdriver driver = webdriver.Firefox() driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie') next_id = 'drhPageForm:drhPageTable:j_idt211:j_idt233_ds_next' pages = [] it = 0 while it < 1795: time.sleep(1) it += 1 bad = True while bad: try: driver.find_element_by_id(next_id).click() bad = False except: print('retry') page = driver.page_source pages.append(page) 

Instead of collecting and storing all the html first, you can also just request what you want, but for that you need lxml or BeautifulSoup .

EDITOR: After starting, I really noticed that we have an error. It was just easy to catch the exception and try again.

+2
source share

All Articles