Returns the average of n (non-index values) from the population

I have a List<int> and I need to remove outliers, so you want to use an approach where I take only the middle n. I want a middle in terms of values, not an index.

For example, given the following list, if I wanted an average of 80%, I would expect 11 and 100 to be deleted.

11,22,22,33,44,44,55,55,55,100.

Is there a simple or built-in way to do this in LINQ?

+7
source share
5 answers

I have a List<int> and I need to remove outliers, so you want to use an approach where I take only the middle n. I want a middle in terms of values, not an index.

Correctly eliminating outliers depends entirely on a statistical model that accurately describes the distribution of data that you have not provided to us.

Assuming this is a normal (Gaussian) distribution, this is what you want to do.

First calculate the average. It's simple; it's just the sum divided by the number of elements.

Secondly, calculate the standard deviation. The standard deviation is a measure of how to β€œspread” the data around the average. Calculate it:

  • take the difference between each point from the middle
  • square difference
  • take the average of the squares - this is the variance
  • take the square root of the variance - this is the standard deviation

In the normal distribution, 80% of the items are within 1.2 standard deviations from the average. For example, suppose the average value is 50 and the standard deviation is 20. You expect 80% of the sample to fall between 50 - 1.2 * 20 and 50 + 1.2 * 20. Then you can filter the items from the list that are outside of this range.

Please note that this does not eliminate outliers. This is the removal of elements that differ from the standard ones by standard deviations from the average, in order to obtain an interval of 80% of the average. In a normal distribution, you expect to see "outliers" on a regular basis. 99.73% of the items are within three standard deviations from the average, which means that if you have a thousand observations, it is perfectly normal to see two or three observations of more than three standard deviations outside the average! In fact, anywhere, up to, say, five observations, more than three standard deviations from the average when a thousand observations are specified, probably does not indicate an outlier .

I think you need to carefully determine what you mean by outlier and describe why you are trying to eliminate them. Things that look like emissions are potentially not emissions at all, they are real data that you should pay attention to.

Also note that none of these analyzes is correct if the normal distribution is wrong! You may encounter great difficulties in eliminating what looks like outliers when in fact you really made a mistake in the statistical model. If the model is heavier than the normal distribution, then outliers are common rather than stand out . Be careful! If your distribution is abnormal, you need to tell us what the distribution is before we can recommend how to identify outliers and eliminate them.

+11
source

You can use the Enumerable.OrderBy method to sort your list, then use Enumerable.Skip and Enumerable.Take , for example:

 var result = nums.OrderBy(x => x).Skip(1).Take(8); 

Where nums is your list of integers.

Figure out which values ​​to use as arguments for Skip and Take should look something like this: if you just want "average n values":

 nums.OrderBy(x => x).Skip((nums.Count - n) / 2).Take(n); 

However, when the result (nums.Count - n) / 2 not an integer, how do you want the code to work?

+4
source

Assuming you are not doing a weighted average fun business:

 List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 }; int min = ints.Min(); double range = (ints.Max() - min); var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} ); results.Where(o => o.Weight >= .1 && o.Weight < .9); 

Then you can filter by weight as needed. Remove the top / botton n % as desired.

In your case:

 results.Where(o => o.Weight >= .1 && o.Weight < .9) 

Edit: As an extension method, because I like extension methods:

 public static class Lulz { public static List<int> MiddlePercentage(this List<int> ints, double Percentage) { int min = ints.Min(); double range = (ints.Max() - min); var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} ); double tolerance = (1 - Percentage) / 2; return results.Where(o => o.Weight >= tolerance && o.Weight < 1 - tolerance).Select(o => o.IntegralValue).ToList(); } } 

Using:

 List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 }; var results = ints.MiddlePercentage(.8); 
+2
source

Usually, if you want to exclude statistical outliers from a set of values, you should calculate the arithmetic mean and standard deviation for the set, and then remove the values ​​that are farther from the average than you want (measure in standard deviation). Normal distribution; your classic bell-shaped curve; possesses the following properties:

  • About 68% of the data will be within +/- 1 standard deviation from the mean.
  • About 95% of the data will be within +/- 2 standard deviations from the mean.
  • About 99.7% of the data will be within +/- 3 standard deviations from the mean.

You can get Linq extension methods for calculating standard deviation (and other statistical functions) at http://www.codeproject.com/KB/linq/LinqStatistics.aspx

+2
source

I will not doubt the correct calculation of emissions, since I had a similar need to make just such a choice. The answer to a specific question about taking average n:

 List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 }; var result = ints.Skip(1).Take(ints.Count() - 2); 

This skips the first element and stops to the last, giving you only the middle elements. Here is a link to the .NET Fiddle demonstrating this request.

https://dotnetfiddle.net/p1z7em

0
source

All Articles