"Online" (iterative) algorithms for assessing the statistical median, mode, asymmetry, kurtosis?

Is there an algorithm for evaluating the median, mode, asymmetry and / or kurtosis of a set of values, but this does NOT require storing all values ​​in memory at once?

I would like to calculate the basic statistics:

  • means: arithmetic mean
  • variance: mean squared deviation from mean
  • standard deviation: the square root of the variance
  • median: a value that separates a large half of numbers from a smaller half
  • mode: the most common value found in the set
  • asymmetry: TL; DR
  • excess: TL; DR

The basic formulas for calculating any of them are the arithmetic of the school, and I know them. There are many statistics libraries that also implement them.

My problem is the large number (billions) of values ​​in the sets that I process: while working in Python, I cannot just make a list or hash with billions of elements. Even if I wrote this in C, arrays with billionth elements are not very practical.

Data is not sorted. It was produced randomly, on the fly, by other processes. The size of each set varies greatly, and the sizes will not be known in advance.

I already understood how to correctly handle the average value and variance, repeating each value in the set in any order. (Actually, in my case, I take them in the order in which they are created.) Here we use the algorithm, kindly http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm :

  • Initialize three variables: count, sum, and sum_of_squares
  • For each value:
    • Incremental account.
    • Add a value to the amount.
    • Add a squared value to sum_of_squares.
  • Divide the amount in the account, keeping the average as a variable.
  • Divide sum_of_squares by the account, keeping mean_of_squares as a variable.
  • The square value stored as square_of_mean.
  • Subtracts squared_expressions from mean_stores, saving as variance.
  • The average value and variance of the output.

This on-line algorithm has weaknesses (for example, problems with accuracy, since sum_of_squares grows faster more than integer range or float precision), but basically gives me what I need, without having to store every value in each set .

But I do not know if there are similar methods for assessing additional statistics (median, mode, asymmetry, excess). I could live with a biased assessment, or even a method that compromises accuracy to some extent, if the memory needed to process N values ​​is significantly less than O (N).

Pointing to an existing statistics library will also help if the library has functions for calculating one or more of these on-line operations.

+77
iterator algorithm statistics median
Jun 29 '09 at 15:02
source share
12 answers

Subcutaneous and excess

For online asymmetry and kurtosis algorithms (according to variance), see the same wiki page for parallel algorithms for statistics with higher momentum.

Median

The median is tough without sorted data. If you know how much data you have, in theory you only need to partially sort, for example. using the selection algorithm. However, this does not help much with billions of values. I would suggest using frequency, see the next section.

Median and frequency counting mode

If these are integers, I would calculate the frequencies , possibly cutting off the highest and lowest values ​​outside of a certain value, where I am sure that this is no longer relevant. For a float (or too many integers) I would probably create buckets / intervals and then use the same approach as for integers. (Approximate) mode and median calculation, which simplifies, based on a table of frequencies.

Usually distributed random variables

If it is normally distributed, I would use a sample of mean , variance , skewness, and kurtosis as estimates of maximum likelihood for a small subset. (On-line) algorithms for calculating you now. For example. read a few hundred thousand or a million data until your error estimate is small enough. Just make sure that you select randomly from your set (for example, you do not introduce bias by selecting the first 100,000 values). The same approach can also be used to estimate the mode and median for the normal case (since the average value of the sample is an estimate).

Additional comments

All of the algorithms described above can be executed in parallel (including many sorting and selection algorithms like QuickSort and QuickSelect), if that helps.

I always assumed (with the exception of the normal distribution section) that we are talking about sample moments, median and mode, and not about estimates for theoretical moments, given the known distribution.

In general, data sampling (i.e., viewing only a subset) should be quite successful, given the amount of data, if all observations are the implementation of the same random variable (have the same distribution) and the moments, mode and median really exist for this distribution . The last warning is not harmless. For example, the average (and all higher points) for Cauchy Distribution does not exist. In this case, the average sample value of the "small" subset can be massively disconnected from the average sample value for the entire sample.

+50
Jun 29 '09 at 16:14
source share
— -

I use these incremental-recursive average and median ratings that use persistent storage:

mean += eta * (sample - mean) median += eta * sgn(sample - median) 

where eta is a small parameter of the learning speed (for example, 0.001), and sgn () is the signum function, which returns one of {-1, 0, 1}. (Use the eta constant if the data is unsteady and you want to track changes over time, otherwise for stationary sources you can use something like eta = 1 / n for the average estimate, where n is the number of samples that are visible so far ... unfortunately, this does not seem to work for a median estimate.)

This type of incremental average rating seems to be used universally, for example. in uncontrolled neural network training rules, but the median version seems much less common, despite its advantages (emission resistance). It seems that the median version can be used as a substitute for the average rating in many applications.

I would like to see an incremental assessment of the regime of a similar form ...

UPDATE

I just changed the incremental average to evaluate arbitrary quantiles. In general, the quantile function ( http://en.wikipedia.org/wiki/Quantile_function ) tells you a value that divides the data into two fractions: p and 1-p. The following values ​​are evaluated step by step:

 quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0) 

The value of p must be in the range [0,1]. This significantly shifts the symmetric output function sgn () {-1,0,1}, leaning toward one side, dividing the data samples into two cells of uneven size (fractions of p and 1-p data are smaller / larger than the quantile estimate, respectively) . Note that for p = 0.5 this comes down to a median estimate.

+50
Jan 27 '10 at 5:24
source share

I implemented the P-Square Algorithm to dynamically calculate quantiles and histograms without saving the observations in the neat Python module that I wrote, called LiveStats . It should solve your problem efficiently enough. The library supports every statistic that you specify, except for the mode. I have not yet found a satisfactory solution for evaluating the regimen.

+9
Mar 09 '13 at 16:34
source share

Ryan, I'm afraid you are not doing the middle and deviations correctly ... It appeared a few weeks ago here . And one of the strengths of the online version (which is actually called the Welford method) is that it is particularly accurate and stable, see Discussion. One of the important points is that you do not need to store the total amount or the total amount of squares ...

I can’t come up with any online mode and median approach that seems to require consideration of the entire list at once. But it may well be that a similar approach, different from variance and average, will work for asymmetry and excess ...

+7
Jun 29 '09 at 17:55
source share

The Wikipedia article cited in the question contains formulas for calculating asymmetries and excesses online.

In mode - I believe - there is no way to do this online. What for? Suppose all values ​​of your input are different, except for the last one, which duplicates the previous one. In this case, you need to remember all the values ​​that you already saw in the input file to determine that the last value duplicates the value considered before and makes it the most frequent.

For the median value, it’s almost the same - until the last input, you don’t know what value will become median if all the input values ​​are different, because they can be before or after the current median. If you know the length of the input, you can find the median without storing all the values ​​in memory, but you still have to store many of them (I think about half), because an unsuccessful input sequence can significantly shift the median to the second half, maybe any value from the first half of the median.

(Note that I am only referring to an accurate calculation.)

+3
Jun 29 '09 at 16:03
source share

If you have billions of data points, then you are unlikely to need exact answers, not tight answers. As a rule, if you have billions of data points, the main process that generates them is likely to be subject to some statistical stability / ergodicity / mixing properties. The question may also arise whether you expect the distributions to be sufficiently continuous or not.

In these conditions, there are algorithms for online, low memory, quantile estimation (median is a special case of 0.5 quantile), and also modes if you do not need exact answers. This is an active statistics field.

quantification example: http://www.computer.org/portal/web/csdl/doi/10.1109/WSC.2006.323014

example assessment mode: Bickel DR. Reliable estimates of the mode and asymmetry of continuous data. Computational statistics and data analysis. 2002; 39: 153-163. doi: 10.1016 / S0167-9473 (01) 00057-3.

These are the active fields of computational statistics. You find yourself in fields where there is not one of the best accurate algorithm, but their variety (statistical estimates, in truth), which have different properties, assumptions and performance. This is experimental math. There are probably hundreds and thousands of documents on this subject.

The final question is whether you really need the asymmetry and excess on their own, or rather some other parameters that can be more reliable in characterizing the probability distribution (assuming you have a probability distribution!). Do you expect gauss?

Do you have ways to clean / preprocess the data to make it mostly Gaussian? (for example, the amount of financial transactions is often somewhat Gaussian after the adoption of the logarithms). Do you expect final standard deviations? Do you expect fat tails? Things you like in tails or in bulk?

+2
Oct 18 '09 at 23:08
source share

Everyone says that you cannot do this mode online, but that is simply not the case. Here's an article describing the algorithm that does exactly this problem, invented in 1982 by Michael E. Fisher and Stephen L. Salzberg of Yale University. From the article:

The majority search algorithm uses one of its registers to temporarily store one element from the stream; This item is the current candidate for the majority item. The second register - this counter is initialized to 0. For each element of the stream, we specify an algorithm to perform the following procedure. If the counter reads 0, set the current element of the stream as the new majority candidate (crowding out any other element that may already be in the register). Then, if the current element matches the majority candidate, increase the counter; otherwise, decrease the counter. At this point in the cycle, if the part of the stream that is still visible has a majority element, this element is in the candidate register, and the counter has a value greater than 0. What should I do if there is no majority element? Without making a second pass through the data, which is not possible in a stream environment. an algorithm cannot always give a definite answer in this circumstance. These are just promises to correctly identify most of the element, if any.

It can also be expanded to find the top N with more memory, but this should solve it for the mode.

+2
Feb 19 '12 at 17:22
source share

Ultimately, if you do not have a priori parametric knowledge of the distribution, I think you should store all the values.

However, if you are not dealing with a pathological situation, a corrector (Rousseuw and Bassett 1990) may well be good enough for your purposes.

Very simply, it involves calculating the median batches of medians.

+1
Jul 27 '09 at 21:14
source share

median and mode cannot be calculated online using only constant space. However, since the median and mode are in any case more “descriptive” than “quantitative”, they can be estimated, for example. by fetching a dataset.

If the data is normal, distributed over the long term, then you can simply use your average to estimate the median.

You can also estimate the median using the following method: set the median estimate M [i] for each, say, 1,000,000 records in the data stream, so M [0] is the median of the first million records, M [1] is the median of the second million records etc. Then use the median M [0] ... M [k] as the median score. This, of course, saves space, and you can control how much space you want to use by setting the 1,000,000 parameter. It can also be generalized recursively.

0
Jun 29 '09 at 16:18
source share

OK dude try:

for C ++:

 double skew(double* v, unsigned long n){ double sigma = pow(svar(v, n), 0.5); double mu = avg(v, n); double* t; t = new double[n]; for(unsigned long i = 0; i < n; ++i){ t[i] = pow((v[i] - mu)/sigma, 3); } double ret = avg(t, n); delete [] t; return ret; } double kurt(double* v, double n){ double sigma = pow(svar(v, n), 0.5); double mu = avg(v, n); double* t; t = new double[n]; for(unsigned long i = 0; i < n; ++i){ t[i] = pow( ((v[i] - mu[i]) / sigma) , 4) - 3; } double ret = avg(t, n); delete [] t; return ret; } 

where you say that you can already calculate the variance of the sample (svar) and the average (avg) you point them to your functions to do this.

Also, take a look at Pearson's approximation. on such a large dataset, that would be very similar. 3 (average - average) / standard deviation of your median as max - min / 2

for float mode does not matter. as a rule, insert them into bins with a small size (for example, 1/100 * (max-min)).

0
Jan 29 '13 at 13:37
source share

I would like to use buckets that can be adaptive. Bucket size must be accurate. Then, when each data item arrives, you add it to the corresponding bucket account. They should give you simple approximations to the median and kurtosis, counting each bucket as its value, weighted by its score.

One of the problems could be the loss of floating point resolution after billions of operations, i.e. adding one does not change the value! To get around this, if the maximum bucket size exceeds a certain limit, you can take a large number of all samples.

-one
Oct 28 '10 at 23:38
source share
 for j in range (1,M): y=np.zeros(M) # build the vector y y[0]=y0 #generate the white noise eps=npr.randn(M-1)*np.sqrt(var) #increment the y vector for k in range(1,T): y[k]=corr*y[k-1]+eps[k-1] yy[j]=y list.append(y) 
-one
Jan 18 '16 at 14:27
source share



All Articles