Use stepAIC in model list

Question

Use stepAIC in model list

I want to do stepwise regression using AIC in the list of linear models. the idea is to use an e-list of linear models, and then apply stepAIC to each element of the list. He is failing.

Hi guys, I tried to track the problem. I seem to have found a problem. However, I do not understand the reason. Try the code to see the difference between the three cases.

require(MASS) n<-30 x1<-rnorm(n, mean=0, sd=1) #create rv x1 x2<-rnorm(n, mean=1, sd=1) x3<-rnorm(n, mean=2, sd=1) epsilon<-rnorm(n,mean=0,sd=1) # random error variable dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame dat$id<-c(rep(1,10),rep(2,10),rep(3,10)) # y is combination from all three x and a random uniform variable dat$y<-x1+x2+x3+epsilon # apply lm() only resulting in a list of models dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d)) stepAIC(dat.lin.model.lst[[1]]) # FAIL!!! # apply function stepAIC(lm())- works dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d))) # create model for particular group with id==1 k<-which(dat$id==1) # manually select records with id==1 lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k]) stepAIC(lin.model.id1) # check stepAIC - works!

I am sure that stepAIC () needs the raw data from data.frame "dat". This is what I thought about before. (I hope I'm right.) But there is no parameter in stepAIC () where I can pass the original data frame. Obviously, for simple models that are not wrapped in a list, just go through the model. (last three lines in the code) Therefore, I wonder:
Q1: How does stepAIC know where to find the original "dat" data (not just the model data that is passed as a parameter)?
Q2: How can I know that there is another parameter in stepAIC () that is not explicitly listed on the help pages? (maybe my english is too bad to find)
Q3: How to pass this parameter stepAIC ()?

It should be somewhere in the environment of the apply function and pass data. Either lm () or stepAIC (), and the pointer / link to the raw data must be lost somewhere. I don’t understand very well what the environment is in R. For me it was like isolating local variables from global ones. But maybe this is more complicated. Anyone who can explain this to me in relation to the problem above? Honestly, I don't read much from the R documentation . Any better understanding will help me. Thanks.

OLD: I have data in a dataframe df that can be divided into several subgroups. To do this, I created a groupID with the name df $ id. lm () returns the coefficient, as expected for the first subgroup. I want to do stepwise regression using AIC as a criterion for each subgroup separately. I use lmList {lme4}, which leads to a model for each subgroup (id). But if I use stepAIC {MASS} for list items, it throws an error. See below.

So the question is: what is the error in my procedure / syntax? I get results for individual models, but not for those created using lmList. Does lmList () store different model information than lm ()?
But the help says: class "lmList": a list of objects of the lm class with a common model.

 >lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df) >lme4.list.lm[[1]] Call: lm(formula = formula, data = data) Coefficients: (Intercept) Gap.um Standoff.um Voidflaeche.px 62.306133 -0.009878 0.026317 -0.015048 >stepAIC(lme4.list.lm[[1]], direction="backward") #stepAIC on first element on the list of linear models Start: AIC=295.12 Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px Df Sum of Sq RSS AIC - Standoff.um 1 2.81 7187.3 293.14 - Gap.um 1 29.55 7214.0 293.37 <none> 7184.4 295.12 - Voidflaeche.px 1 604.38 7788.8 297.97 Error in terms.formula(formula, data = data) : 'data' argument is of the wrong type

Obviously something is not working with the list. But I have no idea what it could be. Since I tried to do the same with the base package that creates the same model (at least the same coefficients). The results are shown below:

 >lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),]) # id is in order, so should be the same subgroup as for the first list element in lmList Coefficients: (Intercept) Gap.um Standoff.um Voidflaeche.px 62.306133 -0.009878 0.026317 -0.015048

Well, this is what I get with stepAIC on my linear .model. As far as I know, the akaike information criterion can be used to assess which model balances better between fit and generalization, given some data.

 >stepAIC(lin.model,direction="backward") Start: AIC=295.12 Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px Df Sum of Sq RSS AIC - Standoff.um 1 2.81 7187.3 293.14 - Gap.um 1 29.55 7214.0 293.37 <none> 7184.4 295.12 - Voidflaeche.px 1 604.38 7788.8 297.97 Step: AIC=293.14 Scherkraft.N ~ Gap.um + Voidflaeche.px Df Sum of Sq RSS AIC - Gap.um 1 28.51 7215.8 291.38 <none> 7187.3 293.14 - Voidflaeche.px 1 717.63 7904.9 296.85 Step: AIC=291.38 Scherkraft.N ~ Voidflaeche.px Df Sum of Sq RSS AIC <none> 7215.8 291.38 - Voidflaeche.px 1 795.46 8011.2 295.65 Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ]) Coefficients: (Intercept) Voidflaeche.px 71.7183 -0.0151

I read from the conclusion that I should use the model: Scherkraft.N ~ Voidflaeche.px, because this is the minimum AIC. Well, it would be nice if someone could briefly describe the way out. My understanding of stepwise regression (subject to reverse cancellation) is that all regressors are included in the original model. Then the least important is eliminated. The criterion for the decision is AIC. and so on ... Somehow I am having problems to correctly interpret the tables. It would be nice if someone could confirm my interpretation. "-" (minus) means excluded regression. In the upper part there is a “starting” model and in the table below RSS and AIC are calculated for possible exceptions. So, the first row in the first table speaks about the Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px model - Standoff.um will lead to the creation of AIC 293.14. Choose one without Standoff.um: Scherkraft.N ~ Gap.um + Voidflaeche.px

EDIT:
I replaced lmList {lme4} with dlply () to create a list of models. However, stepAIC does not cope with the list. It gives another error. In fact, I believe this is a problem with the data step that the AIC must go through. I was wondering how it calculates the AIC value for each step only from the model data. I will take the initial data to build the models, leaving each regressor every time. In this regard, I would calculate AIC and compare. So how does stepAIC work if it does not have access to the source data. (I cannot see the parameter where I pass the initial stepAIC data). However, I do not know why it works with a simple model, but not with a model wrapped in a list.

 >model.list.all <- dlply(df, .id, function(x) {return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) }) >stepAIC(model.list.all[[1]]) Start: AIC=295.12 Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px Df Sum of Sq RSS AIC - Standoff.um 1 2.81 7187.3 293.14 - Gap.um 1 29.55 7214.0 293.37 <none> 7184.4 295.12 - Voidflaeche.px 1 604.38 7788.8 297.97 Error in is.data.frame(data) : object 'x' not found

+3

r regression lme4

Sebastian Feb 06 '12 at 2:04

source share

2 answers

It seems that stepAIC leaves the loop environment (i.e. in the global environment) in order to search for the data it needs, I trick it with the assignment function:

  results <- do.call(rbind, lapply(response, function (i) { assign("i", response, envir = .GlobalEnv) mdl <- gls(as.formula(paste0(i,"~",paste(expvar, collapse = "+")), data= parevt, correlation = corARMA(p=1,q=1,form= ~as.integer(Year)), weights= varIdent(~1/Linf_var), method="ML") mdl <- stepAIC(mdl, direction ="backward") }))

+3

Xochitl C. Feb 03 '15 at 15:53

source share

Aaron · Accepted Answer · 2012-04-23T14:32:33+0000

I'm not sure what might change in version to make debugging so difficult, but one solution would be to use do.call , which evaluates the expressions in a call before executing it. This means that instead of saving only d in the call, so that update and stepAIC need to find d in order to do its job, it stores a complete representation of the data frame itself.

That is do

 do.call("lm", list(y~x1+x2+x3, data=d))

instead

 lm(y~x1+x2+x3, data=d)

You can see what he is trying to do by looking at the call element of the model, perhaps like this:

 dat.lin.model.lst <- lapply(split(dat, dat$id), function(d) do.call("lm", list(y~x1+x2+x3, data=d)) ) dat.lin.model.lst[[1]]$call

You can also make your list of data frames in the global environment, and then build a call so that update and stepAIC look for each data frame in turn, since their environment chains always return to the global environment; eg:

 dats <- split(dat, dat$id) dat.lin.model.list <- lapply(seq_along(dats), function(d) do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )

To find out what has changed, run dat.lin.model.lst[[1]]$call again.

Use stepAIC in model list

More articles: