Using R to scan multiple pages

Question

Using R to scan multiple pages

So there it is. Please keep in mind that I'm completely green when it comes to writing code, and I have no experience outside R.

Context. Every page I want to crawl has a URL following this format:

http://www.hockey-reference.com/friv/dailyleaders.cgi?month= 10 & day = 8 & year = 2014

Variables that change in this URL are month, day, and year (in bold)

URLs must begin on 10-8-2014 and end on 6-18-2015. Of course, not every day has a game in the NHL, so some pages will be blank.

All other pages have an HTML table for players and a table for goalkeepers.

I figured out how to scan and export to CSV only for the SINGLE page, and I don’t know where to go from here to do it, so that I could do it in one fell swoop for every game last season (getting into the dates that I mentioned above)

Code below:

library(XML)
NHL <- htmlParse("http://www.hockey-reference.com/friv/dailyleaders.cgi?month=10&day=8&year=2014")
class(NHL)
NHL.tables <- readHTMLTable(NHL,stringAsFactors = FALSE)
length(NHL.tables)

head(NHL.tables[[1]])
tail(NHL.tables[[1]])

head(NHL.tables[[2]])
tail(NHL.tables[[2]])

write.csv(NHL.tables, file = "NHLData.csv")

Thanks in advance!

+4

xml r

Yungboy Aug 29 '15 at 16:42

source share

1 answer

Rich Scriven · Accepted Answer · 2015-08-29T17:03:53+0000

I'm not sure how you want to write csv, but here is how you can get all the tables between these dates. I tested this on the first few URLs and it worked well. Please note that you do not need to parse html before reading the table, as it readHTMLTable()is able to read and parse directly from the URL.

library(XML)
library(RCurl)

# create the days
x <- seq(as.Date("2014-10-12"), as.Date("2015-06-18"), by = "day")
# create a url template for sprintf()
utmp <- "http://www.hockey-reference.com/friv/dailyleaders.cgi?month=%d&day=%d&year=%d"
# convert to numeric matrix after splitting for year, month, day
m <- do.call(rbind, lapply(strsplit(as.character(x), "-"), type.convert))
# create the list to hold the results
tables <- vector("list", length(allurls))
# get the tables
for(i in seq_len(nrow(m))) {
  # create the url for the day and if it exists, read it - if not, NULL
  tables[[i]] <- if(url.exists(u <- sprintf(utmp, m[i, 2], m[i, 3], m[i, 1]))) 
    readHTMLTable(u, stringsAsFactors = FALSE) 
  else NULL
}

str() quite long so here is a little look at the dimensions of the first element

lapply(tables[[1]], dim)
# $skaters
# [1] 72 23
#
# $goalies
# [1]  7 15

for() URL-, , . , . , NULL. , , , .

Using R to scan multiple pages

More articles: