Starting the R function cv.glmnet from a glmnet package with large sparse data sets The following error often occurs:
# Error: Matrices must have same number of columns in .local(x, y, ...)
I copied the error with random data:
set.seed(10) X <- matrix(rbinom(5000, 1, 0.1), nrow=1000, ncol=5) X[, 1] <- 0 X[1, 1] <- 1 Y <- rep(0, 1000) Y[c(1:20)] <- 1 model <- cv.glmnet(x=X, y=Y, family="binomial", alpha=0.9, standardize=T, nfolds=4)
This may be due to initial variable screening (based on the internal product of X and Y ). Instead of fixing the coefficient to zero, glmnet drops the variable from the X matrix, and this is done for each of the validation sets. Then, if in some of them the variable is discarded and saved in others, an error appears.
Sometimes nfolds increase. This is consistent with the hypothesis, since a larger number of nfolds means larger subsets of the test and less chance of resetting the variable in any of them.
A few additional notes:
The error appears only for alpha near 1 ( alpha=1 equivalent to L1-regularization) and using standardization. It is not displayed for family="Gaussian" .
What do you think might happen?
source share