Intrinsic result related to structure size and performance

Question

Intrinsic result related to structure size and performance

I was curious about the overhead of a large structure versus a small structure when using the + and * operators for math. Thus, I created two structures, one Small with 1 double field (8 bytes) and one Big with 10 doubles (80 bytes). In all my operations, I control only one field x .

First, I defined mathematical operators in both structures, like

 public static Small operator +(Small a, Small b) { return new Small(ax + bx); } public static Small operator *(double x, Small a) { return new Small(x * ax); }

which is expected to use a lot of memory on the stack to copy fields around. I run 5,000,000 iterations of a mathematical operation and got what I suspected (3x slowdown).

 public double TestSmall() { pt.Start(); // pt = performance timing object Small r = new Small(rnd.NextDouble()); //rnd = Random number generator for (int i = 0; i < N; i++) { a = 0.6 * a + 0.4 * r; // a is a local field of type Small } pt.Stop(); return pt.ElapsedSeconds; }

release code result (in seconds)

 Small=0.33940 Big=0.98909 Big is Slower by x2.91

Now for the interesting part. I define the same operations with static methods with ref arguments

 public static void Add(ref Small a, ref Small b, ref Small res) { res.x = ax + bx; } public static void Scale(double x, ref Small a, ref Small res) { res.x = x * ax; }

and run the same number of iterations in this test code:

 public double TestSmall2() { pt.Start(); // pt = performance timing object Small a1 = new Small(); // local Small a2 = new Small(); // local Small r = new Small(rnd.NextDouble()); //rdn = Random number generator for (int i = 0; i < N; i++) { Small.Scale(0.6, ref a, ref a1); Small.Scale(0.4, ref r, ref a2); Small.Add(ref a1, ref a2, ref a); } pt.Stop(); return pt.ElapsedSeconds; }

And the results show (in seconds)

 Small=0.11765 Big=0.07130 Big is Slower by x0.61

Thus, compared to intensive mem-copy statements, I get x3 and x14 acceleration, which is great, but compare Small struct times with Big and you will see that Small is 60% slower than Big.

Can anyone explain this? Do I need to do this with the CPU pipeline and separate operations in (spatial) memory to provide more efficient data prefetching?

If you want to try this for yourself, grab the code from my dropbox http://dl.dropbox.com/u/11487099/SmallBigCompare.zip

+6

c # struct timing

ja72 Sep 27 '10 at 18:52

source share

5 answers

There are several flaws in your test.

Use Stopwatch instead of the PerformanceTimer type. I am not familiar with the latter and seems to be a third-party component. Particularly worried is that it measures time in EllapsedSeconds instead of EllapsedMilliseconds .
Each test should be performed twice and only the second should be considered in order to eliminate the potential costs of JIT.
Marshal.SizeOf does not create the actual size of the structure, but only the size of the sort.

After switching to Stopwatch I see that the reference performance is as expected, producing almost equal time for both types in the static case ref.

+3

Jaredpar Sep 27 '10 at 19:06

source share

Agreed with Jared, this is a comparison error.

The essence of the problem / inconsistency is that your observation is the result of not "warming up" the tests. This ensures that all types and methods have been loaded into the CLR runtime. You should put a for loop around the main test and always run the tests several times ... monitor the change after the first set in the following results:

  Size of Small is 8 bytes Size of Big is 80 bytes 5,000,000.00 Iterations Operator Results Small=523.00000 Big=1953.00000 Slower=x3.73 StaticRef Results Small=2042.00000 Big=2125.00000 Slower=x1.04 Small=x0.26 Big=x0.92 5,000,000.00 Iterations Operator Results Small=2464.00000 Big=3510.00000 Slower=x1.42 StaticRef Results Small=3578.00000 Big=3647.00000 Slower=x1.02 Small=x0.69 Big=x0.96 5,000,000.00 Iterations Operator Results Small=3921.00000 Big=4817.00000 Slower=x1.23 StaticRef Results Small=4880.00000 Big=4944.00000 Slower=x1.01 Small=x0.80 Big=x0.97

0

csharptest.net Sep 27 '10 at 19:31

source share

I have some suggestions.

Use the Stopwatch class. It uses the same Win32 APIs but is already encoded for you.
Increase the iteration counter so that your benchmarks take at least 1 second (or more) to work, otherwise anomalies may appear and dominate time.
Consider the impact of the vshost.exe process. You will get different results for both Debug and Release builds, depending on whether you are running the application offline or through the Visual Studio host process.

When I ran your code, I saw similar results for the pass-by-ref test in all test cases. What really made me think was how much faster the smaller structure was in the standalone version of Release build (i.e. not via vshost.exe).

Separate release version:

 Size of Small is 8 bytes Size of Big is 80 bytes 50,000,000.00 Iterations Operator Results Small=0.57173 Big=25.58988 Slower=x44.76 StaticRef Results Small=26.06602 Big=26.68569 Slower=x1.02 Small=x0.02 Big=x0.96

Output assembly via vshost:

 Size of Small is 8 bytes Size of Big is 80 bytes 50,000,000.00 Iterations Operator Results Small=4.56601 Big=35.33387 Slower=x7.74 StaticRef Results Small=37.94317 Big=39.64959 Slower=x1.04 Small=x0.12 Big=x0.89

0

Brian gideon Sep 27 '10 at 19:35

source share

Thanks to everyone for their input. Here are some last thoughts.

PerformanceCounter gives the same results as the stopwatch, so this is not a problem.

The final results:

1. For a small structure using the operator or by-ref gives the same performance - 2 .. For a large structure using by-ref, 14 times faster
3 . A large x20 structure is slower than a small structure for operators (as expected)
4. A large structure is approximately 50% slower than a small structure with po-ref (still interesting)

So, the last question is, what is the mechanism that slows down Big struct with-ref, since the stack should not be copied?

Results from an executable executable for visual studio.

 Size of Small is 8 bytes Size of Big is 80 bytes 5,000,000.00 Iterations Warming up the CPU's Using QueryPerformanceCounter Operator Results Small=0.03545 Big=0.71519 Slower=x20.18 StaticRef Results Small=0.03526 Big=0.05194 Slower=x1.47 Small=x1.01 Big=x13.77

0

ja72 Sep 27 '10 at 20:02

source share

Jon skeet · Accepted Answer · 2010-09-27T19:01:48+0000

I can not reproduce your results. On my box, the “ref” version has basically the same performance for Big and Small , within tolerance.

(Running Release mode without an attached debugger is 10 or 100 times more iterations to try to get a good long run.)

Have you tried to run your version for many iterations? Is it possible that during the tests your processor gradually increases its clock speed (since it sees that it has to work hard)?

Intrinsic result related to structure size and performance

Results from an executable executable for visual studio.

More articles: