Which method uses outline = FALSE to determine outliers?

In R, I used the parameter outline = FALSE to exclude outliers when building a window and a whisker for a specific set. It worked spectacularly, but I don’t care how exactly it determines which elements are emissions.

boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE) 
+4
source share
3 answers

The “ejection” in the terminology of the “box and mustache” graphs is any point in the data set that falls beyond a given distance from the median, usually about 2.5 times the difference between the median and 0.25 (lower) or 0.75 ( upper) quantile. To get there, see ?boxplot.stats : first, look at the definition of out on the output

out : the values ​​of any data points that lie outside the extremes of the mustache ( if(do.out) ).

These are "emissions".

Secondly, look at the definition of whiskers, which are based on the coef parameter, which is 1.5 by default:

the mustache extends to the most extreme data point, which is no more than coef times the length of the window.

Finally, look at the definition of “hinges,” which are the ends of a field:

Two “hinges” are versions of the first and third quartiles, i.e. close to quantiles (x, s (1,3) / 4).

Put them together and you will get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the corresponding quartile. The reasons for these somewhat confusing definitions (I think) are partly historical and partly the desire to have chart components reflect the actual values ​​that are present in the data (and not, say, halfway between two data points) as much as possible. (You will probably need to return to the original literature listed on the help page for complete excuses and explanations.)

The thing to keep in mind is that points defined as “outliers” of this algorithm are not necessarily outliers in the usual statistical sense (for example, points that are surprisingly extreme are based on a specific statistical data model) / STRONG>. In particular, if you have a large data set, you are sure to see a lot of “outliers” (one sign that you can switch to a more flexible graphical summary, such as a treble clef or beanplot).

+7
source

For boxplot outliers are points that are above or below the whiskers. By default, they apply to data points that do not exceed the interquartile range, multiplying the range argument from the field. The default value for range is 1.5, but you can change it, and you can also change the list of outliers.

You can also see that using the boxplot.stats function, which performs the calculations used by the chart.

For example, if you have the following vector:

 v <- c(runif(10), -0.5, -1) boxplot(v) 

enter image description here

By default, only -1 is considered an outlier. You can see it with boxplot.stats :

 boxplot.stats(v)$out [1] -1 

But if you change the range argument (or coef one for boxplot.stats ), then -1 is no longer considered an outlier:

 boxplot(v, range=2) 

enter image description here

 boxplot.stats(v, coef=2)$out numeric(0) 
+2
source

Admittedly, this is not immediately apparent from boxplot() . Look at the range parameter:

this determines how far the stretches of the box out of the box extend. If “the range is positive, the whiskers extend to the most extreme data point, which does not exceed the“ time interval between the quartile ranges. ”A value of zero causes the whiskers to expand to the extreme values ​​of the data.

Thus, the range value, as well as the interquartile range and field (specified by the quartiles), are used to determine where the whiskers end. And everything that is outside the mustache is an outlier.

I will be the first to agree that this definition is not intuitive. Unfortunately, it is already installed.

+1
source

All Articles