How to update the `lm` or` glm` model on the same subset of data?

I am trying to install two nested models and then test them against each other using the anova function. Commands Used:

 probit <- glm(grad ~ afqt1 + fhgc + mhgc + hisp + black + male, data=dt, family=binomial(link = "probit")) nprobit <- update(probit, . ~ . - afqt1) anova(nprobit, probit, test="Rao") 

However, the variable afqt1 seems to contain NA , and since the update call does not accept the same subset of data, anova() returns an error

Error in anova.glmlist (c (list (object), dotargs), variance = variance ,: models were not fully adapted to the same data set size

Is there an easy way to get the model installed on the same dataset as the original model?

+2
r regression glm lm
source share
1 answer

As pointed out in the comments, a direct approach to this is to use model data from the first match (for example, probit ) and update ability to overwrite arguments from the original call.

Here's a reproducible example:

 data(mtcars) mtcars[1,2] <- NA nobs( xa <- lm(mpg~cyl+disp, mtcars) ) ## [1] 31 nobs( update(xa, .~.-cyl) ) ##not nested ## [1] 32 nobs( xb <- update(xa, .~.-cyl, data=xa$model) ) ##nested ## [1] 31 

Simply define a convenient wrapper around this:

 update_nested <- function(object, formula., ..., evaluate = TRUE){ update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate) } 

This forces the data argument of the updated call to reuse data from the first suitable model.

 nobs( xc <- update_nested(xa, .~.-cyl) ) ## [1] 31 all.equal(xb, xc) ##only the `call` component will be different ## [1] "Component "call": target, current do not match when deparsed" identical(xb[-10], xc[-10]) ## [1] TRUE 

So now you can easily make anova :

 anova(xa, xc) ## Analysis of Variance Table ## ## Model 1: mpg ~ cyl + disp ## Model 2: mpg ~ disp ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 28 269.97 ## 2 29 312.96 -1 -42.988 4.4584 0.04378 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Another suggested approach is na.omit in the data frame before calling lm() . At first, I thought that it would be impractical when working with a large data frame (e.g. 1000 cols) and a large number of var in various specifications (e.g. 15 vars), but not because of speed. Such an approach requires the manual accounting of which vary should be sanitized from the National Assembly and which should not, and exactly what the OP is trying to guess. The biggest drawback is that you should always synchronize the formula with a subset of the data frame.

This, however, can be easily overcome, as it turned out:

 data(mtcars) for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA nobs( xa <- lm(mpg~cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, mtcars) ) ## [1] 21 nobs( xb <- update(xa, .~.-cyl) ) ##not nested ## [1] 22 nobs( xb <- update_nested(xa, .~.-cyl) ) ##nested ## [1] 21 nobs( xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))])) ) ##nested ## [1] 21 all.equal(xb, xc) ## [1] "Component "call": target, current do not match when deparsed" identical(xb[-10], xc[-10]) ## [1] TRUE anova(xa, xc) ## Analysis of Variance Table ## ## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb ## Model 2: mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 10 104.08 ## 2 11 104.42 -1 -0.34511 0.0332 0.8591 
+1
source share

All Articles