Sum (.) In the factor column returns an incorrect result

I have a strange fix here. I use data.table for a very routine task, but there is something that I cannot explain. I figured out the way around the problem, but I think it’s still important for me to understand what is wrong here.

This code will bring the data to the workspace:

 library(XML) library(data.table) theurl <- "http://goo.gl/hOKW3a" tables <- readHTMLTable(theurl) new.Res <- data.table(tables[[2]][4:5][-(1:2),]) suppressWarnings(names(new.Res) <- c("Party","Cases")) 

There are two columns, Party and Cases . Both of them have a default class of factor . Although Cases should be numeric . Ultimately, I just want to get the amount of Cases for each Party . So something like this should work:

 new.Res[,sum(Cases), by=Party] 

But this does not give the correct answer. I thought this would work if I change the Cases class from factor to numeric . So I tried the following:

 new.Res[,Cases := as.numeric(Cases)] new.Res[,sum(Cases), by=Party] 

But I have the same wrong answer. I realized that the problem is changing the Cases class from factor to numeric . So I tried another method and it worked:

Step1: re-initialize the data:

 theurl <- "http://goo.gl/hOKW3a" tables <- readHTMLTable(theurl) new.Res <- data.table(tables[[2]][4:5][-(1:2),]) suppressWarnings(names(new.Res) <- c("Party","Cases")) 

Step 2. Use another method to change the class from factor to numeric :

 new.Res[,Cases := strtoi(Cases)] new.Res[,sum(Cases), by=Party] 

It's fine! However, I'm not sure what happened to the first two methods. What am I missing?

+4
r data.table
source share
1 answer

The correct way to convert from factor to numeric or integer is to go through character . This is due to the fact that internally, the factor is an integer index (which refers to the vector levels ). When you tell R to convert it to numeric , it just converts the base index, rather than trying to convert the level label.

Short answer: do Cases:=as.numeric(as.character(Cases)) .

Edit: As an alternative, the ?factor man page offers as.numeric(levels(Cases))[Cases] as more efficient. h / t @Gsee in the comments.

+7
source share

All Articles