Rollover Deviation Algorithm

I am trying to find an efficient, numerically stable algorithm for calculating rolling variance (for example, variance over a 20-period rolling window). I know the Welford algorithm, which effectively calculates the current variance for a stream of numbers (only one pass is required), but I'm not sure if this can be adapted for a moving window. I would also like to decide to avoid the accuracy issues discussed at the top of this article . The solution in any language is great.

+51
algorithm statistics variance
Feb 28 2018-11-28T00:
source share
10 answers

I ran into this problem. There are some excellent reports in cumulative dispersion calculations such as John Cooke. Accurately calculates the error message and the message from Digital explorations, Python code for calculating population samples and variances, covariance, and correlation coefficient. I just could not find any windows adapted to rolling.

Performing standard deviations from email messages was critical in making the rolling formula window. Jim takes the energy sum of squared differences of values ​​compared to the Walford method using the sum of squared differences of the mean. The formula is as follows:

PSA today = PSA (yesterday) + ((((x today * x today) - x yesterday)) / n

  • x = value in your time series
  • n = number of values ​​you have analyzed.

But to convert a Power Sum Average formula to a window variety, you need to fine-tune the formula as follows:

PSA today = PSA yesterday + (((x today * x today) - (x yesterday * x yesterday) / n

  • x = value in your time series
  • n = number of values ​​you have analyzed.

You will also need the Rolling Simple Moving Average formula:

SMA today = SMA yesterday + ((x today - x today - n) / n

  • x = value in your time series
  • n = period used for your transition window.

From there, you can calculate the rollover deviation:

Var population today = (today PSA * n - n * SMA today * SMA) / n

Or rollover deviation:

Example Var today = (PSA today * n - n * SMA today * SMA today) / (n - 1)

I covered this topic with a sample Python code in a blog post a few years ago, Running Variance .

Hope this helps.

Please note: I have provided links to all blog posts and mathematical formulas in latex (images) for this answer. But because of my low reputation (<10); I am limited to only two hyperlinks and absolutely no images. sorry about it. I hope this does not remove the content.

+20
Jul 29 '12 at 18:10
source share
— -

I deal with the same problem.

The average is easy to calculate iteratively, but you need to save the full history of values ​​in a circular buffer.

next_index = (index + 1) % window_size; // oldest x value is at next_index, wrapping if necessary. new_mean = mean + (x_new - xs[next_index])/window_size; 

I adapted the Welford algorithm and works for all the values ​​I tested with.

 varSum = var_sum + (x_new - mean) * (x_new - new_mean) - (xs[next_index] - mean) * (xs[next_index] - new_mean); xs[next_index] = x_new; index = next_index; 

To get the current variance, simply divide varSum by window size: variance = varSum / window_size;

+7
Jul 12 '11 at 12:33
source share

If you prefer to use the code above the words (mainly based on the DanS entry): http://calcandstuff.blogspot.se/2014/02/rolling-variance-calculation.html

 public IEnumerable RollingSampleVariance(IEnumerable data, int sampleSize) { double mean = 0; double accVar = 0; int n = 0; var queue = new Queue(sampleSize); foreach(var observation in data) { queue.Enqueue(observation); if (n < sampleSize) { // Calculating first variance n++; double delta = observation - mean; mean += delta / n; accVar += delta * (observation - mean); } else { // Adjusting variance double then = queue.Dequeue(); double prevMean = mean; mean += (observation - then) / sampleSize; accVar += (observation - prevMean) * (observation - mean) - (then - prevMean) * (then - mean); } if (n == sampleSize) yield return accVar / (sampleSize - 1); } } 
+6
Apr 24 '14 at 18:32
source share

Here, the separator and subjugation approach is applied, which has O(log k) -time updates, where k is the number of samples. It should be relatively stable for the same reasons that pair summation and FFT are stable, but this is a little more complicated and the constant is small.

Suppose we have a sequence A length m with an average of E(A) and a variance of V(A) , and a sequence B length n with an average of E(B) and a variance of V(B) . Let C be the concatenation of A and B We have

 p = m / (m + n) q = n / (m + n) E(C) = p * E(A) + q * E(B) V(C) = p * (V(A) + (E(A) + E(C)) * (E(A) - E(C))) + q * (V(B) + (E(B) + E(C)) * (E(B) - E(C))) 

Now draw the elements in a red-black tree, where each node is decorated with the average value and variance of the subtree embedded in this node. Paste on the right; delete on the left. (Since we only access the ends, the splay tree may be O(1) amortized, but I assume that amortization is a problem for your application.) If k known at compile time, maybe you can expand the FFTW inner loop.

+5
Feb 28 2018-11-11T00:
source share

In fact, the Weldords algorithm allows AFAICT to easily adapt to calculate weighted variance. And by setting the scale to -1, you should be able to effectively undo items. I did not check the math, whether it allows negative weights, but at first glance it should be!

I did a little experiment using ELKI :

 void testSlidingWindowVariance() { MeanVariance mv = new MeanVariance(); // ELKI implementation of weighted Welford! MeanVariance mc = new MeanVariance(); // Control. Random r = new Random(); double[] data = new double[1000]; for (int i = 0; i < data.length; i++) { data[i] = r.nextDouble(); } // Pre-roll: for (int i = 0; i < 10; i++) { mv.put(data[i]); } // Compare to window approach for (int i = 10; i < data.length; i++) { mv.put(data[i-10], -1.); // Remove mv.put(data[i]); mc.reset(); // Reset statistics for (int j = i - 9; j <= i; j++) { mc.put(data[j]); } assertEquals("Variance does not agree.", mv.getSampleVariance(), mc.getSampleVariance(), 1e-14); } } 

I get about 14 digits of accuracy compared to the exact two-pass algorithm; this is about the same as you would expect from a pair. Please note that Welford does receive some computational costs due to additional divisions - it takes about twice as much as the exact two-pass algorithm. If the size of your window is small, it may be much more reasonable to actually recalculate the average value, and then skip the variance in the second one each time.

I added this experiment as a unit test in ELKI, you can see the full source here: http://elki.dbs.ifi.lmu.de/browser/elki/trunk/test/de/lmu/ifi/dbs/elki/math /TestSlidingVariance.java it is also compared with accurate two-pass dispersion.

However, the behavior may be different on skewed datasets. Obviously, this data set is uniformly distributed; but I also tried a sorted array and it worked.

+4
Jan 05 '13 at
source share

I look forward to the fact that I am wrong, but I do not think that this can be done "quickly." However, most of the calculation tracks the EV over the window, which can be done easily.

I will go away with the question: are you sure you need a window function? If you are not working with very large windows, it is probably best to use the well-known predefined algorithm.

+1
Feb 28 2018-11-28T00:
source share

I assume that tracking your 20 samples, Sum (X ^ 2 from 1..20) and Sum (X from 1..20), and then sequentially recounting two sums at each iteration, is not effective enough? You can recalculate a new variance without adding, squaring, etc. All samples every time.

How in:

 Sum(X^2 from 2..21) = Sum(X^2 from 1..20) - X_1^2 + X_21^2 Sum(X from 2..21) = Sum(X from 1..20) - X_1 + X_21 
+1
Feb 28 2018-11-21T00:
source share

Here's another solution O(log k) : find the squares of the original sequence, then pairs of sums, then fours, etc. (you need a little buffer so that you can find all this efficiently.) Then add the values ​​you need to get the answer. For example:

 ||||||||||||||||||||||||| // Squares | | | | | | | | | | | | | // Sum of squares for pairs | | | | | | | // Pairs of pairs | | | | // (etc.) | | ^------------------^ // Want these 20, which you can get with | | // one... | | | | // two, three... | | // four... || // five stored values. 

Now you use the standard formula E (x ^ 2) -E (x) ^ 2, and you're done. (No, if you need good stability for small sets of numbers, this was assuming that it was just a bunch of pumping errors that caused problems.)

Thus, adding up 20 squares of numbers is very fast these days on most architectures. If you did more - say, a couple of hundred - a more efficient method would certainly be better. But I'm not sure that brute force is not the way here.

+1
Feb 28 2018-11-28T00:
source share

For only 20 values, it is trivial to adapt the method open here (I did not say quickly, though).

You can simply select an array from 20 of these RunningStat classes.

The first 20 elements of the stream are somewhat special, but once this is done, it is much simpler:

  • when a new element arrives, clear the current RunningStat instance, add the element to all 20 instances and increase the "counter" (modulo 20), which identifies the new "full" RunningStat instance
  • at any time, you can refer to the current "full" instance to get your own version of execution.

Obviously this approach is not scalable ...

You may also notice that there is some redudancy in the numbers we save (if you go with the full RunningStat class). An obvious improvement would be to save the last 20 Mk and Sk directly.

I can’t think of a better formula using this particular algorithm, I’m afraid that its recursive formulation binds our hands somewhat.

+1
Mar 01 '11 at 9:01
source share

I know this question is old, but in case anyone is interested in this, Python code follows. He is inspired by johndcook 's blog post, @Joachim, @DanS, and @Jaime's comments. The code below still gives slight inaccuracies for the small size of the data windows. Enjoy it.

 from __future__ import division import collections import math class RunningStats: def __init__(self, WIN_SIZE=20): self.n = 0 self.mean = 0 self.run_var = 0 self.WIN_SIZE = WIN_SIZE self.windows = collections.deque(maxlen=WIN_SIZE) def clear(self): self.n = 0 self.windows.clear() def push(self, x): self.windows.append(x) if self.n <= self.WIN_SIZE: # Calculating first variance self.n += 1 delta = x - self.mean self.mean += delta / self.n self.run_var += delta * (x - self.mean) else: # Adjusting variance x_removed = self.windows.popleft() old_m = self.mean self.mean += (x - x_removed) / self.WIN_SIZE self.run_var += (x + x_removed - old_m - self.mean) * (x - x_removed) def get_mean(self): return self.mean if self.n else 0.0 def get_var(self): return self.run_var / (self.WIN_SIZE - 1) if self.n > 1 else 0.0 def get_std(self): return math.sqrt(self.get_var()) def get_all(self): return list(self.windows) def __str__(self): return "Current window values: {}".format(list(self.windows)) 
0
Aug 29 '17 at 22:26
source share



All Articles