XML package error in 2.12 but not 2.10

I use the XML package in R to read HTML tables from a page. In 2.12.1, I get the following error:

Error in names(ans) = header : 
  'names' attribute [24] must be the same length as the vector [19]

However, when I run the same piece of code in 2.10, there are no errors, and everything is analyzed (almost) fine. I say almost because the column names are taken from the first row of the table, but I can get around this.

Here is my code:

## load the libraries
library(XML)

## set the season
SEASON <- "2011"

## create the URL
URL <- paste("http://www.hockey-reference.com/leagues/NHL_", SEASON, "_goalies.html", sep="")

## grab the page -- the table is parsed nicely -- why work 2.10, but not 2.12.1?
tables <- readHTMLTable(URL)

Any help you can provide would be greatly appreciated.

+1
source share
1 answer

I am not sure if this problem occurs due to the transition to v2.12.1 or not. I tried this on 2.12.1 and got the same error.

, - HTML . HTML , , . HTML : 1) , 2) .

, . 19, : 19 5, .. 24. .

, readHTMLTable(). , scrapeR XML:

# load the libraries
library(XML)
library(scrapeR)
library(plyr)
library(stringr)

# scrape and parse page
page <- scrape(url=URL, parse=TRUE)
raw <- xpathSApply(page[[1]], "//table//tr", xmlValue)
# split strings at each line break
rows <- strsplit(raw, "\n")
# now check for longest row length, and discard all short rows
rowlength <- (laply(rows, length))
rows <- rows[rowlength==max(rowlength)]
# unlist each row
rows <- laply(rows, function(x)unlist(x))
# trim white space
rows <- aaply(rows, c(1,2), str_trim)
# convert to data frame
df <- as.data.frame(rows, stringsAsFactors = FALSE)
# read names from first row
names(df) <- laply(df[1, ], str_trim)
# remove all rows without a numerix index
df <- df[which(!is.na(as.numeric(df$Rk))), ]
df

, , , .

, , .

+1

All Articles