Predict.glm () with three new categories in test data (r) (error)

I have a data set called data that contains 481,092 rows.

I split data into two equal halves:

  • The first half (line 1: 240 546) is called train and was used for glm() ;
  • the second half (line 240 547: 481 092) is called test and should be used to test the model;

Then I started the regression:

 testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"), data=train) 

Now the forecast:

 prediction <- predict.glm(testreg, newdata=test, type="response") 

gives me an error:

 Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels): Factor 'train$manufacturerID' has new levels 125, 136, 137 

Now I know that these levels were omitted in the regression because they do not show coefficients for these levels.

I tried this: predict.lm () with an unknown factor level in the test data . But this somehow does not work for me, or maybe I just don’t understand how to implement it. I want to predict a dependent binary variable, but, of course, only with existing coefficients. In the above link, it is suggested to indicate R that lines with new levels should simply be called / or treated as NA.

How can I continue?

Edit-Proposed Approach Z. Li

I had a problem in the first step:

 xlevels <- testreg$xlevels$manufacturerID mID125 <- xlevels[1] 

but mID125 is NULL ! What I did wrong?

0
source share
2 answers

As you separate your train and test samples from rowanberry, some factor levels of your variables are not equally represented on both trains and test samples.

You need to make a stratified sample to ensure that both test and test samples have all representations of the factor level. Use stratified from splitstackshape .

+3
source

It is not possible to estimate new levels of factors in modeling a fixed effect , including linear models and generalized linear models. glm (as well as lm ) keeps records of what levels of factors are presented and used during model installation and can be found in testreg$xlevels .

Your model model to evaluate the model:

 returnShipment ~ size + color + price + manufacturerID + salutation + state + age + deliverytime 

then predict complains about new levels of coefficients 125, 136, 137 for manufactureID . This means that these levels are not inside testreg$xlevels$manufactureID , therefore they do not have an appropriate coefficient for forecasting. In this case, we must discard this factor variable and use the prediction formula:

 returnShipment ~ size + color + price + salutation + state + age + deliverytime 

However, the standard predict procedure cannot accept your customized prediction formula. There are usually two solutions:

  • extract the model matrix and model coefficients from testreg and manually predict the model terms we want using matrix vector multiplication. This is what the link provided in your post suggests;
  • reset factor levels in test to any level appeared in testreg$xlevels$manufactureID , for example, testreg$xlevels$manufactureID[1] . Thus, we can still use the standard predict for prediction.

Now let's first select the coefficient used to set the model

 xlevels <- testreg$xlevels$manufacturerID mID125 <- xlevels[1] 

Then we assign this level to your forecast data:

 replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels) test$manufacturerID <- replacement 

And we are ready to predict:

 pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!! 

In the end, we adjust this linear predictor by subtracting the factor estimate:

 est <- coef(testreg)[paste0(manufacturerID, mID125)] pred <- pred - est 

Finally, if you want to predict the original scale, you use the feedback function:

 testreg$family$linkinv(pred) 

update:

You complained that you encountered various problems while trying to solve the above solutions. That's why.

Your code:

 testreg <- glm(train$returnShipment~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"), data=train) 

is a very bad way to specify the model formula. train$returnShipment etc. It will limit the environment for obtaining variables strictly for the train data frame, and you will have problems with the subsequent forecasting with other data sets, for example test .

As a simple example of such a flaw, we model some toy data and install GLM:

 set.seed(0); y <- rnorm(50, 0, 1) set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE) foo <- data.frame(y = y, a = factor(a)) toy <- glm(foo$y ~ foo$a, data = foo) ## bad style > toy$formula foo$y ~ foo$a > toy$xlevels $`foo$a` [1] "a" "b" "c" "d" 

Now we see that everything goes with the prefix foo$ . During the prediction:

 newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata" rm(foo) ## remove "foo" from R session predict(toy, newdata) 

we get the error:

Error in eval (expr, envir, enc): object 'foo' not found

A good style is to specify the environment for receiving data from the data argument of the function:

 foo <- data.frame(y = y, a = factor(a)) toy <- glm(y ~ a, data = foo) 

then foo$ leaves.

 > toy$formula y ~ a > toy$xlevels $a [1] "a" "b" "c" "d" 

This explains two things:

  • You complained to me about the comment that when you do testreg$xlevels$manufactureID , you get NULL ;
  • Predicted error

     Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels): Factor 'train$manufacturerID' has new levels 125, 136, 137 

    complains train$manufacturerID instead of test$manufacturerID .

+3
source

All Articles