I deal with the logistic regression described in the book "Introduction to Statistical Learning with Applications in R" by James, Witten, Hasti, Tibbrani (2013).
In particular, I fit the binary classification model to the Salary data set from the R ISLR package described in Β§7.8.1.
The age of the predictor (converted to polynomial, degree 4) is set against the binary classification of wages> 250. Then the age is built against the predicted probabilities of the βtrueβ value.
The model in R is as follows:
fit=glm(I(wage>250)~poly(age,4),data=Wage, family=binomial) agelims=range(age) age.grid=seq(from=agelims[1],to=agelims[2]) preds=predict(fit,newdata=list(age=age.grid),se=T) pfit=exp(preds$fit)/(1+exp(preds$fit))
Full code (author's site): http://www-bcf.usc.edu/~gareth/ISL/Chapter%207%20Lab.txt
Corresponding plot from the book: http://www-bcf.usc.edu/~gareth/ISL/Chapter7/7.1.pdf (right)
I tried fitting the model to the same data in scikit-learn:
poly = PolynomialFeatures(4) X = poly.fit_transform(df.age.reshape(-1,1)) y = (df.wage > 250).map({False:0, True:1}).as_matrix() clf = LogisticRegression() clf.fit(X,y) X_test = poly.fit_transform(np.arange(df.age.min(), df.age.max()).reshape(-1,1)) prob = clf.predict_proba(X_test)
Then I plotted the probabilities of the True values ββin relation to the age range. But the result / plot looks very different. (Not to mention the CI or rugplot ranges, just a probability graph.) Did I miss something here?