Retrieve web scraping information from multiple web pages on a screen

I am trying to get some business information from the Internet. Most of the information is on this page: http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul , the page looks like this: enter image description here

On this page, I need to click on the Busqueda de Companias tab, after which the interesting side will begin. When I click, I get the following screen: enter image description here On this page, I have to set the Nombre option, and then I need to insert a line with the name. For example, I will add the line PROAÑO & ASOCIADOS CIA. LTDA. PROAÑO & ASOCIADOS CIA. LTDA. and I will get the following screen: enter image description here

Then I need to press Buscar and I will get the following screen: enter image here

On this screen, I have information for this enterprise. Next, I have to click the Informacion Estados Financieros tab, and I get the following screen: enter image description here

In this final screen, I need to click the Estado Situacion tab, and I will receive information from the enterprise in the columns Codigo de la cuenta contable , Nombre de la cuenta contable and Valor . I would like to get this information stored in a data frame. Most of the hard side that I found started when I needed to install the Nombre element, insert a line, then Buscar and click until I found the Informacion Estados Financieros tab. I tried using the html_session and html_form from rvest , but these elements are empty.

Could you help me with some steps to solve this problem?

+8
r rvest
source share
2 answers

RSelenium coding example

Here is an example of stand-alone code using the website the question links to.

Observation: Please do not run this code.

Why? Having 1k Stack users on a website is a DDOS attack.


Introduction Background

In the code below, RSelenium will be installed before running the code you need:

The code below will lead you from the second page [ http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul] to the last page, where the information you are interested in ...

Useful links:

If you are interested in using RSelenium, I highly recommend that you read the following links, thanks for John Harrison for developing the RSelenium package.

Code example


 # We want to make this as easy as possible to use # So we need to install required packages for the user... # if (!require(RSelenium)) install.packages("RSelenium") if (!require(XML)) install.packages("XML") if (!require(RJSONIO)) install.packages("RSJONIO") if (!require(stringr)) install.packages("stringr") # Data # mainPage <- "http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul" businessPage <- "http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul" # StartServer # We assume RSelenium is not setup, so we check if the RSelenium # server is available, if not we install RSelenium server. checkForServer() # OK. now we start the server RSelenium::startServer() remDr <- RSelenium::remoteDriver$new() # We assume the user has installed Firefox and the Selenium IDE # https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/ # # Ok we open firefix remDr$open(silent = T) # Open up a firefox window... # Now we open the browser and required URL... # This is the page that matters... remDr$navigate(businessPage) # First things first on the first page, lets get the id for the radio_button, # name Element, and button. We need all three. # radioButton <- remDr$findElements(using = 'css selector', ".z-radio-cnt") nameElement <- remDr$findElements(using = 'css selector', ".z-combobox-inp") searchButton <- remDr$findElements(using = 'css selector', ".z-button-cm") # Optional: we can highlight the radio elements returned # lapply(radioButton, function(x){x$highlightElement()}) # Optional: we can highlight the nameElement returned # lapply(nameElement, function(x){x$highlightElement()}) # Optional: we can highlight the searchButton returned # lapply(searchButton, function(x){x$highlightElement()}) # Now we can select and press the third radio button radioButton[[3]]$clickElement() # We fill in the required name... nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA.")) # This is subtle but required the page triggers a drop down list, so rather than # hitting the searchButton, we first select, and hit enter in the drop down menu... selectElement <- remDr$findElements(using = 'css selector', ".z-comboitem-text") selectElement[[1]]$clickElement() # OK, now we can click the search button, which will cause the next page to open searchButton[[1]]$clickElement() # New Page opens... # # Ok, so now we first pull the list of buttons... finPageButton <- remDr$findElements(using = 'class name', "m_iconos") # Now we can press the required button to open the page we want to get too... finPageButton[[9]]$clickElement() # We are now on the required page. 

Now we are on the landing page [See picture]

Retrieving table values ​​...

The next step is to extract the table values. To do this, we extract the data .z-listitem css-selector . Now we can check if we see data rows. We do, so now we can extract the return values ​​and populate the list or Dataframe.

 # Ok, now we need to extract the table, we identify and pull out the # '.z-listitem' and assign to modalWindow modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem") # Now we can extract the lines from modalWindow... Now that each line is # returned as a single line of text, so we split into three based on the # line marker "/n' lineText <- str_split(modalWindow[[1]]$getElementText()[1], '\n') lineText 

here, this is the result:

 > lineText <- stringr::str_split(modalWindow[[1]]$getElementText()[1], '\n') > lineText [[1]] [1] "10" [2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES, JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES" [3] "0.00" 

Work with hidden data.

Selenium WebDriver and therefore RSelenium only interact with the visible elements of a web page. If we try to read the entire table, we will return only those table elements that are visible (not closed).

We can navigate this issue by scrolling it to the bottom of the table. We make the table populate due to the scroll action. Then we can extract the complete table.

 # Select the .z-listbox-body modalWindow <- remDr$findElements(using = 'css selector', ".z-listbox-body") # Now we tell the window we want to scroll to the bottom of the table # This triggers the table to populate all the rows modalWindow[[1]]$executeScript("window.scrollTo(0, document.body.scrollHeight)") # Now we can extract the complete table modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem") lineText <- stringr::str_split(modalWindow[[9]]$getElementText(), '\n') lineText 

What the code does.

The above code example should be self-sufficient. By this I mean that he must install everything he needs, including the necessary packages. After installing the dependent R packages, the R code will checkForServer() ; if Selenium is not installed, the call will install it. This may take some time

My recommendation is that you go through the code since I have not added any delays (you would like to be in production), also note that I am not optimized for speed, but rather for some clarity [from my point of view] .. .

It has been shown that the code works:

  • Mac OS X 10.11.5
  • RStudio 0.99.893
  • R version 3.2.4 (2016-03-10) - "Very safe meals"

enter image description here

+5
source share

Check out RSelenium

+2
source share

All Articles