Recursively download zip files from a web page (Windows)

Question

Recursively download zip files from a web page (Windows)

Is it possible to download all zip files from a web page without specifying individual links one at a time.

I want to download all monthly zip files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html .

I am using Windows 8.1, R3.1.1. I don’t have wgeton the PC, so you cannot use a recursive call.

Alternative: As a workaround, I tried to load the text of the web page itself. Then I would like to extract the name of each zip file, which can then be passed in download.filein a loop. However, I struggle with extracting the name.

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"

temp <- tempfile()
download.file(pth,temp)
dat <- readLines(temp)
unlink(temp)

g <- dat[grepl("accounts_monthly", tolower(dat))]

g contains character strings with file names, among other characters.

g
 [1] "                    <li><a href=\"Accounts_Monthly_Data-September2013.zip\">Accounts_Monthly_Data-September2013.zip  (775Mb)</a></li>"
 [2] "                    <li><a href=\"Accounts_Monthly_Data-October2013.zip\">Accounts_Monthly_Data-October2013.zip  (622Mb)</a></li>"

Accounts_Monthly_Data-September2013.zip .., (. )

    gsub(".*\\>(\\w+\\.zip)\\s+", "\\1", g)

g <- c("                    <li><a href=\"Accounts_Monthly_Data-September2013.zip\">Accounts_Monthly_Data-September2013.zip  (775Mb)</a></li>", 
"                    <li><a href=\"Accounts_Monthly_Data-October2013.zip\">Accounts_Monthly_Data-October2013.zip  (622Mb)</a></li>"
)

+4

regex r download

user2957945 28 . '14 16:39

1

jdharrison · Accepted Answer · 2014-11-28T16:45:02+0000

XML:

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
library(XML)
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles)
mapply(download.file, url = fileURLS, destfile = myfiles)

"//a[contains(text(),'Accounts_Monthly_Data')]" XPATH. XML , (a), "Accounts_Monthly_Data". . fun = xmlAttrs XML xmlAttrs. xml. href, .

Recursively download zip files from a web page (Windows)

More articles: