Comparison of R and scikit-learn for a classification problem with logistic regression

Question

Comparison of R and scikit-learn for a classification problem with logistic regression

I deal with the logistic regression described in the book "Introduction to Statistical Learning with Applications in R" by James, Witten, Hasti, Tibbrani (2013).

In particular, I fit the binary classification model to the Salary data set from the R ISLR package described in §7.8.1.

The age of the predictor (converted to polynomial, degree 4) is set against the binary classification of wages> 250. Then the age is built against the predicted probabilities of the “true” value.

The model in R is as follows:

fit=glm(I(wage>250)~poly(age,4),data=Wage, family=binomial) agelims=range(age) age.grid=seq(from=agelims[1],to=agelims[2]) preds=predict(fit,newdata=list(age=age.grid),se=T) pfit=exp(preds$fit)/(1+exp(preds$fit))

Full code (author's site): http://www-bcf.usc.edu/~gareth/ISL/Chapter%207%20Lab.txt
Corresponding plot from the book: http://www-bcf.usc.edu/~gareth/ISL/Chapter7/7.1.pdf (right)

I tried fitting the model to the same data in scikit-learn:

 poly = PolynomialFeatures(4) X = poly.fit_transform(df.age.reshape(-1,1)) y = (df.wage > 250).map({False:0, True:1}).as_matrix() clf = LogisticRegression() clf.fit(X,y) X_test = poly.fit_transform(np.arange(df.age.min(), df.age.max()).reshape(-1,1)) prob = clf.predict_proba(X_test)

Then I plotted the probabilities of the True values in relation to the age range. But the result / plot looks very different. (Not to mention the CI or rugplot ranges, just a probability graph.) Did I miss something here?

+4

scikit-learn r machine-learning logistic-regression

Jordi Jul 28 '15 at 9:05

source share

1 answer

Jordi · Accepted Answer · 2015-07-30T14:28:17+0000

After some more reading, I understand that scikit-learn implements a regularized logistic regression model, while glm in R is not regulated. The implementation of the GLM file of Statsmodels (python) is irregular and gives identical results, as in R.

http://statsmodels.sourceforge.net/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM

The LiblineaR R package is similar to the scikit-learn logical regression (when using the "liblinear" solver).

https://cran.r-project.org/web/packages/LiblineaR/

Comparison of R and scikit-learn for a classification problem with logistic regression

More articles: