I have a List<int> and I need to remove outliers, so you want to use an approach where I take only the middle n. I want a middle in terms of values, not an index.
Correctly eliminating outliers depends entirely on a statistical model that accurately describes the distribution of data that you have not provided to us.
Assuming this is a normal (Gaussian) distribution, this is what you want to do.
First calculate the average. It's simple; it's just the sum divided by the number of elements.
Secondly, calculate the standard deviation. The standard deviation is a measure of how to βspreadβ the data around the average. Calculate it:
- take the difference between each point from the middle
- square difference
- take the average of the squares - this is the variance
- take the square root of the variance - this is the standard deviation
In the normal distribution, 80% of the items are within 1.2 standard deviations from the average. For example, suppose the average value is 50 and the standard deviation is 20. You expect 80% of the sample to fall between 50 - 1.2 * 20 and 50 + 1.2 * 20. Then you can filter the items from the list that are outside of this range.
Please note that this does not eliminate outliers. This is the removal of elements that differ from the standard ones by standard deviations from the average, in order to obtain an interval of 80% of the average. In a normal distribution, you expect to see "outliers" on a regular basis. 99.73% of the items are within three standard deviations from the average, which means that if you have a thousand observations, it is perfectly normal to see two or three observations of more than three standard deviations outside the average! In fact, anywhere, up to, say, five observations, more than three standard deviations from the average when a thousand observations are specified, probably does not indicate an outlier .
I think you need to carefully determine what you mean by outlier and describe why you are trying to eliminate them. Things that look like emissions are potentially not emissions at all, they are real data that you should pay attention to.
Also note that none of these analyzes is correct if the normal distribution is wrong! You may encounter great difficulties in eliminating what looks like outliers when in fact you really made a mistake in the statistical model. If the model is heavier than the normal distribution, then outliers are common rather than stand out . Be careful! If your distribution is abnormal, you need to tell us what the distribution is before we can recommend how to identify outliers and eliminate them.