Getting the average p95 and p99 of a data stream

I have input data and I want to calculate the average, 95th and 99th percentile of this data - I am most interested in the last 1000 values. At any time, I would like to query this object to get any of the three values ​​(this can happen at any time, and not only in the case when the numbers seen modulo 1000 are 0). Is there a way to get these three values ​​without saving the last 1000 samples?

This does not have to be perfect, so we can use some tricks to get a good grade. In addition, speed is another problem. Thanks

(I will do it in C ++, but I do not think it is so important)

+7
source share
2 answers

At a minimum, you will need to maintain a queue of the last 1000 items.

To maintain the average value, maintain the total number of the last 1000 elements; when you add a new item to the queue, you add its value to the total amount, and also subtract the value of the oldest item that you just removed from the queue. Return the total divided by 1000 and there you go.

To maintain the current Nth percentile, maintain two heaps and keep the number of elements in heaps; the β€œbottom” heap has a lower N% value, and the β€œtop” heap has a top (1-N)% (for example, the bottom 95th percentile heap will have 950 items, and the top fifth percentile heap will have 50 items). At any time, you can return the lowest element from the top heap and your percentile. When you remove an item from the last value queue, also remove the value from the heap. If this leaves the heaps unbalanced (for example, the lower heap has 951 elements and the upper heap has 49 elements), then shift the elements to balance them (for example, remove the upper element from the lower heap and add it to the upper heap).

Since you want two percentiles, use three heaps β€” the bottom heap has the bottom 950 elements, the middle one has 40, and the top one has the highest 10. Return the lowest element of the middle heap for the 95th percentile, and the lowest element of the upper heap for 99- th percentile.

Adding and removing heap elements is O (lg (n)), so this is the cost of adding a new element to the queue and three heaps: remove the oldest queue element from heaps (O (lg (n)), add the new queue element to corresponding heap (O (lg (n)) and, if necessary, compare heaps (again, O (lg (n)). Add a new element to the lowest heap whose senior element is larger than the heap element, i.e.

if (newElement < lowestHeap.maxElement) { lowestHeap.add(newElement) } else if (newElement < middleHeap.maxElement) { middleHeap.add(newElement) } else { highestHeap.add(newElement) } 

Make sure your heaps allow duplicate items

+2
source

First, suppose you can afford to store 1000 numbers (say, k times 1000, where k is a constant).

Keep 3 heaps:

  • Mineral for storing 10 (or 50) elements (heapA)
  • Maximum saving of the remaining 990 (or 950 elements) (heapB)
  • A mineral to preserve the order of elements. The oldest element is always at the top of the heapC heap)

Three heaps are special: heapC also stores a reference to the corresponding element in heapA or heapB. heapA and heapB also track the same element in heapC.

Here's how it works:

  • Suppose you have 1000 elements in the system. heapA has 10 elements, heapB is 990 and heapC has 1000 elements.
  • Delete the oldest item from the system. Remove it from heapC and using the link remove it from heapA or heapB
  • Rebalance the three heaps.
  • Add a new element order in heapA or heapB depending on the heapA vertex
  • Add element order to heapC.
  • Also add links to each other.
0
source

All Articles