Consistent factor levels for the same cost compared to different datasets

I'm not sure if I fully understand how factors work. So please correct me in an easy to understand way if I am wrong.

I always assumed that when performing regressions and what not, R behind the scenes carries categorical variables into integers, but this part was outside my thinking.

He will use the categorical values ​​in the training set and, after building the model, check the same categorical value in the test data set. Regardless of what was at the core of the “levels,” it didn't matter to me.

However, I thought more ... and need clarification - especially if I am doing it wrong, how to fix it.

     train= c("March","April","January","November","January")
     train=as.factor(train)
     str(train)
     Factor w/ 4 levels "April","January",..: 3 1 2 4 2

     test= c(c("March","April"))
     test=as.factor(test)
      str(test)
     # Factor w/ 2 levels "April","March",..:  1 2

Question

, , , . .

, "APRIL" "1" , "" - 2, "" - 2 2-.

, , , TEST ... /

, , .

+2
1

as.factor / , R ; , , 1, 2 ..

, "" , ( , /dfs, id), :

x <- letters[1:5]
y <- letters[3:8]
allvalues <- unique(union(x,y))  # superfluous but I think it adds clarity
x <- factor(x, levels = allvalues)
y <- factor(y, levels = allvalues)
str(x)   # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
str(y)   # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8

, , R , , :

y <- sample(1:2, size = 20, replace = T)
x <- factor(letters[y], levels = c("b","a"))  # so a~2 and b~1
y <- y + rnorm(0, 0.2, n = 20)
Set <- data.frame(x = x, y = y)
fit <- lm(data = Set, y ~ x)

: str(x), str(y), summary(fit).

, fit x = a ( 2) y ~= 1 y = b x ~= 2.

"" :

x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
str(x2)   # Factor w/ 4 levels "c","d","a","b": 3 4

predict, , R:

predict(fit, newdata = data.frame(x = x2))
#        1        2 
# 1.060569 1.961109 

, R...

+4

All Articles