R error - cv.glmnet: matrices must have the same number of columns

Starting the R function cv.glmnet from a glmnet package with large sparse data sets The following error often occurs:

# Error: Matrices must have same number of columns in .local(x, y, ...) 

I copied the error with random data:

 set.seed(10) X <- matrix(rbinom(5000, 1, 0.1), nrow=1000, ncol=5) X[, 1] <- 0 X[1, 1] <- 1 Y <- rep(0, 1000) Y[c(1:20)] <- 1 model <- cv.glmnet(x=X, y=Y, family="binomial", alpha=0.9, standardize=T, nfolds=4) 

This may be due to initial variable screening (based on the internal product of X and Y ). Instead of fixing the coefficient to zero, glmnet drops the variable from the X matrix, and this is done for each of the validation sets. Then, if in some of them the variable is discarded and saved in others, an error appears.

Sometimes nfolds increase. This is consistent with the hypothesis, since a larger number of nfolds means larger subsets of the test and less chance of resetting the variable in any of them.

A few additional notes:

The error appears only for alpha near 1 ( alpha=1 equivalent to L1-regularization) and using standardization. It is not displayed for family="Gaussian" .

What do you think might happen?

+6
source share
1 answer

This example is problematic because one variable has one 1 and the rest are zero. This is the case when logistic regression can diverge (if not regularized), since driving this coefficient ad infinitum (plus or minus depending on the answer) will predict this observation perfectly, and not affect anything else.

The model is now streamlined, so this should not happen, but it causes problems. I found that by decreasing alpha (to this ridge, .5), the problem disappeared.

The real problem here is with the lambda sequence used for each crease, but this is a bit technical. I will try to fix the problem with cv.glmnet, which will fix this problem.

Trevor Hasti (accompanying glmnet)

+8
source

All Articles