Using R to clear the download address link from a web page?

I am trying to automate a process that involves downloading .zip files from multiple web pages and extracting the CSV files that they contain. The challenge is that the .zip file names and therefore the link addresses change weekly or annually, depending on the page. Is there a way to clear the current link links from these pages so that I can then pass these addresses to a function that downloads files?

One of the landing pages is this . The file I want to download is the second bullet under the heading "2015 Realtime Complete All Africa File" --- ie, Zipped.csv. When I write, this file is marked as "Realtime 2015 All Africa File (updated July 11, 2015) (csv)" on the web page, and the link address I want is http://www.acleddata.com/wp -content / uploads / 2015/07 / ACLED-All-Africa-File_20150101-to-20150711_csv.zip , but this should change later today, because the data is updated every Monday - therefore, my task.

I tried, but could not automate the extraction of this .zip file with the extension 'rvest' and selectorgadet in Chrome. Here's how it went:

> library(rvest) > realtime.page <- "http://www.acleddata.com/data/realtime-data-2015/" > realtime.html <- html(realtime.page) > realtime.link <- html_node(realtime.html, xpath = "//ul[(((count(preceding-sibling::*) + 1) = 7) and parent::*)]//li+//li//a") > realtime.link [1] NA 

The xpath in this html_node() call came from highlighting only part (csv) in the Realtime 2015 All Africa File (updated July 11, 2015) (csv) in green, and then clicking on enough other highlighted bits to eliminate all yellow leave only red and green.

Did I make a small mistake in this process, or am I just completely wrong here? As you can tell, I have no experience with HTML and web scraping, so I am very grateful for the help.

+5
source share
1 answer

I think that you are trying to do too much in one xpath expression - I would attack the problem in a sequence of smaller steps:

 library(rvest) library(stringr) page <- html("http://www.acleddata.com/data/realtime-data-2015/") page %>% html_nodes("a") %>% # find all links html_attr("href") %>% # get the url str_subset("\\.xlsx") %>% # find those that end in xlsx .[[1]] # look at the first one 
+10
source

All Articles