Can scipy.stats identify and mask apparent outliers?

With scipy.stats.linregress, I perform a simple linear regression on some sets of highly correlated x, y experimental data and initially visually check each x, y scatter plot for outliers. More generally (i.e., programmatically) is there a way to identify and mask outliers?

+15
python scipy statistics linear-regression
Apr 19 2018-12-12T00:
source share
4 answers

The statsmodels package has what you need. Take a look at this small piece of code and its output:

 # Imports # import statsmodels.api as smapi import statsmodels.graphics as smgraphics # Make data # x = range(30) y = [y*10 for y in x] # Add outlier # x.insert(6,15) y.insert(6,220) # Make graph # regression = smapi.OLS(x, y).fit() figure = smgraphics.regressionplots.plot_fit(regression, 0) # Find outliers # test = regression.outlier_test() outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5) print 'Outliers: ', list(outliers) 

Example figure 1

Outliers: [(15, 220)]

Edit

In the new version of statsmodels everything has changed a bit. Here is a new piece of code that shows detection of the same type of detection.

 # Imports # from random import random import statsmodels.api as smapi from statsmodels.formula.api import ols import statsmodels.graphics as smgraphics # Make data # x = range(30) y = [y*(10+random())+200 for y in x] # Add outlier # x.insert(6,15) y.insert(6,220) # Make fit # regression = ols("data ~ x", data=dict(data=y, x=x)).fit() # Find outliers # test = regression.outlier_test() outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5) print 'Outliers: ', list(outliers) # Figure # figure = smgraphics.regressionplots.plot_fit(regression, 1) # Add line # smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0]) 

Example figure 2

Outliers: [(15, 220)]

+20
Apr 23 '13 at 9:43
source share

scipy.stats has nothing direct to do with emissions, so answer some links and ads for statsmodels (which is a complement to statistics for scipy.stats)

to determine emissions

http://jpktd.blogspot.ca/2012/01/influence-and-outlier-measures-in.html

http://jpktd.blogspot.ca/2012/01/anscombe-and-diagnostic-statistics.html

http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html

instead of disguise it is better to use a reliable rating

http://statsmodels.sourceforge.net/devel/rlm.html

with examples where, unfortunately, charts are not currently displayed http://statsmodels.sourceforge.net/devel/examples/generated/tut_ols_rlm.html

RLM deviations. The results of the assessment have the attribute weights , and for emissions of weights less than 1. This can also be used to search for emissions. RLM also more stable if multiple exits.

+7
Apr 20 2018-12-12T00:
source share

More generally (i.e., programmatically), is there a way to identify and mask outliers?

There are various emission detection algorithms; scikit-learn implements several of them.

[Disclaimer: I'm the author of scikit-learn.]

+6
Apr 19 2018-12-12T00:
source share

You can also limit the effect of outliers using scipy.optimize.least_squares . In particular, see the f_scale parameter:

The default soft field between residual and residual outliers is 1.0 .... This parameter does not affect loss = 'linear', but for other loss values ​​this is crucial.

On the page, they compare 3 different functions: the usual least_squares and two methods involving f_scale :

 res_lsq = least_squares(fun, x0, args=(t_train, y_train)) res_soft_l1 = least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train)) res_log = least_squares(fun, x0, loss='cauchy', f_scale=0.1, args=(t_train, y_train)) 

Least squares comparison

As you can see, the normal least squares are much more dependent on data outliers, and it’s worth playing with various loss functions in combination with different f_scales . Possible loss functions (taken from the documentation):

 'linear' : Gives a standard least-squares problem. 'soft_l1': The smooth approximation of l1 (absolute value) loss. Usually a good choice for robust least squares. 'huber' : Works similarly to 'soft_l1'. 'cauchy' : Severely weakens outliers influence, but may cause difficulties in optimization process. 'arctan' : Limits a maximum loss on a single residual, has properties similar to 'cauchy'. 

The cookbook contains a neat textbook on reliable nonlinear regression.

0
May 04 '17 at 14:57
source share



All Articles