How to measure the variability of a benchmark made up of many subelements?

Question

How to measure the variability of a benchmark made up of many subelements?

(Not strictly programming, but a question that programmers need to answer.)

I have a reference, X , which consists of many subelements x ₁ .. x <sub> psub> . His rather noisy test, the results of which are quite variable. To accurately measure, I must reduce this “variability”, which requires me to measure variability for the first time.

I can easily calculate the variability of each subcategory using perhaps standard deviation or variance. However, I would like to get one number that represents the general variability as one number.

My own attempt to solve this problem:

sum = 0 foreach i in 1..n calculate mean across the 60 runs of x_i foreach j in 1..60 sum += abs(mean[i] - x_i[j]) variability = sum / 60

+4

performance optimization benchmarking statistics

Paul biggar Jul 22 '10 at 8:52

source share

5 answers

I think you underestimate the standard deviation - if you run your test 50 times and have 50 different modes of operation, the standard deviation will be one number that describes how tightly or freely these 50 numbers are distributed around your average. Combined with average lead time, standard deviation helps you see how common your results are.

Consider the following runtimes:

12 15 16 18 19 21 12 14

The average 15.875 time is 15.875 . The standard deviation of this set is 3.27. A good explanation of what 3.27 actually means (in a normally distributed population, approximately 68% of the samples fall into one standard deviation of the mean: for example, between 15.875-3.27 and 15.875+3.27 ), but I think you're just looking for a way to determine how to "tightly" or "spread" the results around your average.

Now consider another set of runtimes (say, after you compiled all your tests with -O2 ):

14 16 14 17 19 21 12 14

The average 15.875 time is also 15.875 . The standard deviation of this set is 3.0. (Thus, approximately 68% of the samples will be between 15.875-3.0 and 15.875+3.0 .) This set is more closely grouped than the first set.

And you have one number that sums up how a compact or free group of numbers is around an average.

Warning

The standard deviation is based on the assumption of a normal distribution but your application may not be redistributed , so please remember that the standard deviation can be approximate at best. Plan your time series in a histogram to see if your data looks roughly normal or homogeneous or multimodal or ...

In addition, I use standard deviation because it is just a sample from the test population space. I am not a professional statistician, so even this basic assumption may be wrong. Any standard deviation of the population or standard deviation of the sample will give you quite good results in your IFF statement, which you adhere to either the sample or the population. Do not mix them.

I mentioned that the standard deviation in combination with the mean will help you understand your data: if the standard deviation is almost the same as your average or, worse, larger, then your data is very scattered, and perhaps your process is not very repeatable. Interpreting a 3% acceleration in the face of a large standard deviation is almost useless, as you recognized. And the best judge (in my experience) of the standard deviation is the average value.

Last note: yes, you can calculate the standard deviation manually, but it is tedious after the first ten or so. It’s best to use a spreadsheet or tungsten alpha or your handy high school calculator.

+1

sarnold Jul 22 '10 at 10:25

source share

From Variance : “the variance of the general group is equal to the average of the variances of the subgroups plus the variance of the means of the subgroups. I had to read this several times and then run it: 464 from this formula == 464, the standard deviation of all the data is the only number you want.

 #!/usr/bin/env python import sys import numpy as np N = 10 exec "\n".join( sys.argv[1:] ) # this.py N= ... np.set_printoptions( 1, threshold=100, suppress=True ) # .1f np.random.seed(1) data = np.random.exponential( size=( N, 60 )) ** 5 # N rows, 60 cols row_avs = np.mean( data, axis=-1 ) # av of each row row_devs = np.std( data, axis=-1 ) # spread, stddev, of each row about its av print "row averages:", row_avs print "row spreads:", row_devs print "average row spread: %.3g" % np.mean( row_devs ) # http://en.wikipedia.org/wiki/Variance: # variance of the total group # = mean of the variances of the subgroups + variance of the means of the subgroups avvar = np.mean( row_devs ** 2 ) varavs = np.var( row_avs ) print "sqrt total variance: %.3g = sqrt( av var %.3g + var avs %.3g )" % ( np.sqrt( avvar + varavs ), avvar, varavs) var_all = np.var( data ) # std^2 all N x 60 about the av of the lot print "sqrt variance all: %.3g" % np.sqrt( var_all )

 row averages: [ 49.6 151.4 58.1 35.7 59.7 48. 115.6 69.4 148.1 25. ] row devs: [ 244.7 932.1 251.5 76.9 201.1 280. 513.7 295.9 798.9 159.3] average row dev: 375 sqrt total variance: 464 = sqrt( av var 2.13e+05 + var avs 1.88e+03 ) sqrt variance all: 464

To see how group variance increases, run the example on Wikipedia. Let's say

 60 men of heights 180 +- 10, exactly 30: 170 and 30: 190 60 women of heights 160 +- 7, 30: 153 and 30: 167.

The average standard dev is (10 + 7) / 2 = 8.5. However, heights

 -------|||----------|||-|||-----------------|||--- 153 167 170 190

spreads like 170 + - 13.2, much more than 170 + - 8.5.
What for? Because we have not only spreads of men + - 10, and women + - 7, but also spreads from 160/180 about the average value of 170.
Exercise: calculate the spread of 13.2 in two ways, from the formula above and directly.

+1

denis Jul 27 '10 at 12:49

source share

This is a difficult problem, because tests can have different natural lengths anyway. So, the first thing you need to do is convert each of the individual control digits into scale-invariant values (for example, an “acceleration coefficient” compared to some likely base) so that you at least have a chance to compare different criteria.

Then you need to choose a way to combine the shapes. Some kind of average. However, there are many types of medium. We can refuse to use the mode and median here; they throw out too much information. But different types of medium are useful because of the different ways they add weight to emissions. I knew (but forgot) whether it was the geometric mean or harmonic mean value that was most useful in practice (the arithmetic mean is less useful here). Geometric mean is the arithmetic mean in the log domain, and the harmonic mean is the arithmetic mean in the inverse region. (Tables make this trivial.)

Now that you have the means to combine the values to run the test suite into something that is informative, you need to do many runs. You might want the computer to do this while you are doing another task. :-) Then try combining the values in different ways. In particular, look at the variance of the individual sub-elements and the variance of the combined reference number. Also consider conducting some analyzes in the logarithmic and reciprocal domains.

Remember, this is a slow business that is hard to get right, and it is usually not informative to download. In the benchmark, there is only testing the performance of exactly what is in the benchmark, and this is basically not how people use the code. It is probably best to consider strictly temporary boxing work on benchmarking and instead focus on whether users think that the software is perceived as fast enough or the required transaction rates are actually achieved during deployment (there are many non-programming programming methods).

Good luck

0

Donal fellows Jul 28 '10 at 10:17

source share

You are trying to solve the wrong problem. Better try to minimize this. Differences may be related to caching.

Try running the code on the same (same) core using the SetThreadAffinityMask () function on Windows.

Drop the first dimension.

Increase thead priority.

Stop hyperthreading.

If you have many conditional jumps, this can cause visible differences between calls with different inputs. (this could be solved by providing exactly the same input for the ith iteration, and then comparing the measured times between these iterations).

Here you can find some useful tips: http://www.agner.org/optimize/optimizing_cpp.pdf

-1

ruslik Jul 22 '10 at 19:32

source share

Matt parker · Accepted Answer · 2010-07-23T05:26:44+0000

Best idea: ask on the Exchange statistical stack after it hits the public beta (in a week).

In the meantime: in fact, you are more interested in the extreme of variability, rather than the central tendency (average, etc.). For many applications, I believe that relatively little can be gained by increasing the typical user experience, but much can be gained by improving the worst user experience. Try the 95th percentile of standard deviations and reduce this. Alternatively, if typical variability is what you want to reduce, orient the standard deviations together. If they are roughly distributed normally, I don’t know a single reason why you could not just take the average value.

How to measure the variability of a benchmark made up of many subelements?

More articles: