Consistent factor levels for the same cost compared to different datasets

Question

Consistent factor levels for the same cost compared to different datasets

I'm not sure if I fully understand how factors work. So please correct me in an easy to understand way if I am wrong.

I always assumed that when performing regressions and what not, R behind the scenes carries categorical variables into integers, but this part was outside my thinking.

He will use the categorical values in the training set and, after building the model, check the same categorical value in the test data set. Regardless of what was at the core of the “levels,” it didn't matter to me.

However, I thought more ... and need clarification - especially if I am doing it wrong, how to fix it.

     train= c("March","April","January","November","January")
     train=as.factor(train)
     str(train)
     Factor w/ 4 levels "April","January",..: 3 1 2 4 2

     test= c(c("March","April"))
     test=as.factor(test)
      str(test)
     # Factor w/ 2 levels "April","March",..:  1 2

Question

, , , . .

, "APRIL" "1" , "" - 2, "" - 2 2-.

, , , TEST ... /

, , .

+2

r categorical-data factors

runningbirds 24 . '16 7:08

1

Alexandre Halm · Answer 1 · 2016-02-24T07:15:38+0000

as.factor / , R ; , , 1, 2 ..

, "" , ( , /dfs, id), :

x <- letters[1:5]
y <- letters[3:8]
allvalues <- unique(union(x,y))  # superfluous but I think it adds clarity
x <- factor(x, levels = allvalues)
y <- factor(y, levels = allvalues)
str(x)   # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
str(y)   # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8

, , R , , :

y <- sample(1:2, size = 20, replace = T)
x <- factor(letters[y], levels = c("b","a"))  # so a~2 and b~1
y <- y + rnorm(0, 0.2, n = 20)
Set <- data.frame(x = x, y = y)
fit <- lm(data = Set, y ~ x)

: str(x), str(y), summary(fit).

, fit x = a ( 2) y ~= 1 y = b x ~= 2.

"" :

x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
str(x2)   # Factor w/ 4 levels "c","d","a","b": 3 4

predict, , R:

predict(fit, newdata = data.frame(x = x2))
#        1        2 
# 1.060569 1.961109

, R...

Consistent factor levels for the same cost compared to different datasets

Question

More articles: