It is not possible to estimate new levels of factors in modeling a fixed effect , including linear models and generalized linear models. glm (as well as lm ) keeps records of what levels of factors are presented and used during model installation and can be found in testreg$xlevels .
Your model model to evaluate the model:
returnShipment ~ size + color + price + manufacturerID + salutation + state + age + deliverytime
then predict complains about new levels of coefficients 125, 136, 137 for manufactureID . This means that these levels are not inside testreg$xlevels$manufactureID , therefore they do not have an appropriate coefficient for forecasting. In this case, we must discard this factor variable and use the prediction formula:
returnShipment ~ size + color + price + salutation + state + age + deliverytime
However, the standard predict procedure cannot accept your customized prediction formula. There are usually two solutions:
- extract the model matrix and model coefficients from
testreg and manually predict the model terms we want using matrix vector multiplication. This is what the link provided in your post suggests; - reset factor levels in
test to any level appeared in testreg$xlevels$manufactureID , for example, testreg$xlevels$manufactureID[1] . Thus, we can still use the standard predict for prediction.
Now let's first select the coefficient used to set the model
xlevels <- testreg$xlevels$manufacturerID mID125 <- xlevels[1]
Then we assign this level to your forecast data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels) test$manufacturerID <- replacement
And we are ready to predict:
pred <- predict(testreg, test, type = "link")
In the end, we adjust this linear predictor by subtracting the factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)] pred <- pred - est
Finally, if you want to predict the original scale, you use the feedback function:
testreg$family$linkinv(pred)
update:
You complained that you encountered various problems while trying to solve the above solutions. That's why.
Your code:
testreg <- glm(train$returnShipment~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"), data=train)
is a very bad way to specify the model formula. train$returnShipment etc. It will limit the environment for obtaining variables strictly for the train data frame, and you will have problems with the subsequent forecasting with other data sets, for example test .
As a simple example of such a flaw, we model some toy data and install GLM:
set.seed(0); y <- rnorm(50, 0, 1) set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE) foo <- data.frame(y = y, a = factor(a)) toy <- glm(foo$y ~ foo$a, data = foo) ## bad style > toy$formula foo$y ~ foo$a > toy$xlevels $`foo$a` [1] "a" "b" "c" "d"
Now we see that everything goes with the prefix foo$ . During the prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata" rm(foo) ## remove "foo" from R session predict(toy, newdata)
we get the error:
Error in eval (expr, envir, enc): object 'foo' not found
A good style is to specify the environment for receiving data from the data argument of the function:
foo <- data.frame(y = y, a = factor(a)) toy <- glm(y ~ a, data = foo)
then foo$ leaves.
> toy$formula y ~ a > toy$xlevels $a [1] "a" "b" "c" "d"
This explains two things:
- You complained to me about the comment that when you do
testreg$xlevels$manufactureID , you get NULL ; Predicted error
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels): Factor 'train$manufacturerID' has new levels 125, 136, 137
complains train$manufacturerID instead of test$manufacturerID .