Alternatives to bestglm for a multi-variable dataset

Question

Alternatives to bestglm for a multi-variable dataset

R version 2.15.0 (2012-03-30) RStudio 0.96.316 Win XP, latest update

I have a dataset with 40 variables and 15,000 observations. I would like to use bestglm to find possible good models (logistic regression). I tried bestglm, however it does not work for such a medium sized dataset. After several tests, I think that bestglm fails if there are more than 30 variables on my computer, at least on my computer (4 GB, dual core).

You can try bestglm limits yourself:

library(bestglm) bestBIC_test <- function(number_of_vars) { # Simulate data frame for logistic regression glm_sample <- as.data.frame(matrix(rnorm(100*number_of_vars), 100)) # Get some 1/0 variable glm_sample[,number_of_vars][glm_sample[,number_of_vars] > mean(glm_sample[,number_of_vars]) ] <- 1 glm_sample[,number_of_vars][glm_sample[,number_of_vars] != 1 ] <- 0 # Try to calculate best model bestBIC <- bestglm(glm_sample, IC="BIC", family=binomial) } # Test bestglm with increasing number of variables bestBIC_test(10) # OK, running bestBIC_test(20) # OK, running bestBIC_test(25) # OK, running bestBIC_test(28) # Error: cannot allocate vector of size 1024.0 Mb bestBIC_test(30) # Error: cannot allocate vector of size 2.0 Gb bestBIC_test(40) # Error in rep(-Inf, 2^p) : invalid 'times' argument

Are there any alternatives that I can use in R to look for possible good models?

+4

r

Tomas greif Aug 17 '12 at 20:28

source share

2 answers

You can try exploring the caret package, which also has tools for choosing a model. I was able to fit the model with 15,000 observations without problems:

 number_of_vars <- 40 dat <- as.data.frame(matrix(rnorm(15000*number_of_vars), 15000)) dat[,number_of_vars][dat[,number_of_vars] > mean(dat[,number_of_vars]) ] <- 1 dat[,number_of_vars][dat[,number_of_vars] != 1 ] <- 0 library(caret) result <- train(dat[,1:39], dat[,40], family = "binomial", method = "glm") result$finalModel

I would recommend extensive documentation for finer control over model equipment.

+5

mengeln Aug 17 '12 at 23:07

source share

Dirk calloway · Accepted Answer · 2014-03-25T18:20:26+0000

Well, for beginners, an exhaustive search for the best subset of 40 variables requires the creation of 2 ^ 40 models, which amount to more than a trillion. This is probably your problem.

Finding exhaustive best subsets is usually not considered optimal for more than 20 or so variables.

The best bet is something like a direct phased selection, which is around (40 ^ 2 + 40) / 2 models of approximately 800.

Or even BETTER (best of all, in my opinion) regularized logistic regression using lasso through the glmnet package.

Good review here .

Alternatives to bestglm for a multi-variable dataset

More articles: